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Abstract 



Building models of language is a central task in natural language processing. Tradition- 
ally, language has been modeled with manually-constructed grammars that describe which 
strings are grammatical and which are not; however, with the recent availability of massive 
amounts of on-line text, statistically-trained models are an attractive alternative. These 
models are generally probabilistic, yielding a score reflecting sentence frequency instead of 
a binary grammaticality judgement. Probabilistic models of language are a fundamental 
tool in speech recognition for resolving acoustically ambiguous utterances. For example, 
we prefer the transcription forbear to four bear as the former string is far more frequent 
in English text. Probabilistic models also have application in optical character recognition, 
handwriting recognition, spelling correction, part-of-speech tagging, and machine transla- 
tion. 

In this thesis, we investigate three problems involving the probabilistic modeling of lan- 
guage: smoothing n-gram models, statistical grammar induction, and bilingual sentence 
alignment. These three problems employ models at three different levels of language; they 
involve word-based, constituent-based, and sentence-based models, respectively. We de- 
scribe techniques for improving the modeling of language at each of these levels, and surpass 
the performance of existing algorithms for each problem. We approach the three problems 
using three different frameworks. We relate each of these frameworks to the Bayesian 
paradigm, and show why each framework used was appropriate for the given problem. Fi- 
nally, we show how our research addresses two central issues in probabilistic modeling: the 
sparse data problem and the problem of inducing hidden structure. 
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Chapter 1 

Introduction 



In this thesis, we describe novel techniques for building probabilistic models of language. 
We investigate three distinct problems involving such models, and improve the state-of-the- 
art in each task. In addition, we show how the techniques developed in this work address 
two central problems in probabilistic modeling. 

In this chapter, we describe what probabilistic models of language are, and demonstrate 
how such models play an important role in many applications. We introduce the three 
problems examined and explain how they are related. Finally, we summarize the basic 
conclusions of this work. 

Chapters [|-|4] describe in detail the work on each of the three tasks: smoothing n-gram 
models, Bayesian grammar induction, and bilingual sentence alignment. Chapter || presents 
the conclusions of this thesis. 

1.1 Models of Language 

A model of language is simply a description of language. In its simplest form, it may just be 
a representation of the list of the sentences belonging to a language; more complex models 
may also try to describe the structure and meaning underlying sentences in a language. 
Historically, attempts to model language have fallen in two general categories. The older 
and more familiar types of models are the grammars that were first developed in the field of 
linguistics. In more recent years, shallow probabilistic models for use in applications such 
as speech recognition have gained common usage. It is these shallow probabilistic models 
that we study in this thesis. In this section, we introduce and contrast these two types of 
models. 

Traditionally, language has been modeled through grammars. In linguistics, it was 
observed that language is structured in a rather constrained hierarchical manner; there seem 
to be a fairly small number of primitive building blocks that can be combined together in 
a limited number of ways to create the widely diverse forms that are found in language. 
For example, at the very lowest level of written English we have the letter. Letters can be 
combined to form words. Words in turn can be combined to form phrases, such as a noun 
phrase, e.g., John Smith or a boat, or a prepositional phrase, e.g., above the table. These 
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phrases in turn can be combined to create sentences, which in turn can be used to build 
paragraphs, and so on. 

Grammars can be used to describe such hierarchical structure in a succinct manner 
( Chomsky, 1964] ) . A grammar consists of rules that describe allowable ways of combining 
structures at one level to form structures at the next higher level. For example, we may 
have a grammar rule of the form: 



Noun-Phrase — > Determiner Noun 



which is generally abbreviated as 

NP -> D N 



This represents the observation that a determiner (e.g., a or the) followed by a noun can 
form a noun phrase. By combining the previous rule with the rules 

D — ► a | the 

N — > boat | cat \ tree 

describing that a determiner can be formed by the words a or the and a noun can be 
formed by the words boat, cat, or tree, we have that strings such as a boat or the tree are 
noun phrases. 

A grammar is a collection of rules like these that describe how to form high-level struc- 
tures such as sentences from low-level structures such as words. Using this representation, 
one can attempt to describe the set of all sentences in a language, and much work in lin- 
guistics is devoted to this goal, though using grammar representations much richer than the 
one described above. 

Such grammars for language have wide application, most notably in the field of natural- 
language processing^ The field of natural-language processing deals with building auto- 
mated systems that are able to process language in some way. For example, one of the 
goals of the field is natural-language understanding, or being able to build systems that can 
understand human-friendly input such as What is the capital of North Dakota instead of 
only computer- friendly input such as find X : capital(X, "North Dakota"). 

Grammars are useful models of language for natural language processing because they 
provide insight into the structure of sentences, which aids in determining their meanings. 
For example, in most systems the first step in processing a sentence is to parse the sentence 
to produce a parse tree. We display the parse tree for Max threw the ball beyond the fence 



in Figure 1.1. The parse tree shows what rules in the grammar need to be applied to form 



the top-level structure, in this case a sentence, from the given lowest-level structures, in this 



1 The term natural language is used to distinguish languages used for human communication such as 
English or Urdu from languages used with machines such as Basic or Lisp. In this thesis, we use the term 
language to mean only natural languages. 
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NP 




PN 



Max 



the ball 



the fence 



Figure 1.1: Parse tree for Max threw the ball beyond the fence 



case words. In this parse tree, the top three nodes represent the applications of the rules 



stating that a noun phrase followed by a verb phrase can form a sentence, a proper noun 
can form a noun phrase, and a verb phrase followed by a prepositional phrase can form a 
verb phrase. 

Substrings of the sentence that are exactly spanned by nodes in the parse tree are 
intended to correspond to units that are relevant in determining the meaning of the sentence, 
and are called constituents. For example, the phrases Max, the ball, threw the ball, and Max 
threw the ball beyond the fence are all constituents, while the ball beyond and threw the are 
not. To give another example, the phrase the ball beyond the fence, while meaningful, is not 
a constituent because in this sentence the phrase beyond the fence describes the throw, not 
the ball. Thus, we see how grammars model not only which sentences belong to a language, 
but also the structure that underlies the meaning behind language. 

While grammars have been the prevalent tool in modeling language for a long time, it 
is generally accepted that building grammars that can handle unrestricted language is at 
least many years away. Instead, interest has shifted away from natural language under- 
standing toward applications that do not require such a rich model of language. The most 
prominent of these applications is speech recognition, the task of constructing systems that 
can automatically transcribe human speech. 

In speech recognition, a model of language is used to help disambiguate acoustically 
ambiguous utterances. For example, consider the task of transcribing an acoustic signal 
corresponding to the string 



S 
NP 
VP 



NP VP 
PN 

VP PP 



he is too forbearing 
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Possible transcriptions include the following: 

T\ = he is too forbearing 
T2 = he is two four baring 

While both strings have the same pronunciation, we prefer the first transcription because 
it is the one that is more likely to occur in language. Hence, we see how models that reflect 
the frequencies of different strings in language are useful in speech recognition. 

Such models typically take forms very different than linguistic grammars. For example, 
a common shallow probabilistic model is the bigram model. With a bigram model, the 
probability of the sentence he is too forbearing is expressed as 

p(he)p(is\he)p(too\is)p(forbearing\ too) 

Each probability p(wi \ Wi-\) attempts to reflect how often the word Wi follows the word Wi—i 
in language, and these probabilities are estimated by taking statistics on large amounts of 
text. 

There are many significant differences between the shallow models used in speech recog- 
nition and the grammatical models used in linguistics and natural language processing 
besides their disparate representations. In linguistics, one attempts to build grammars that 
correspond exactly to the set of grammatical sentences. In speech recognition, one attempts 
to model how frequently strings are spoken, regardless of grammaticality. In linguistics and 
natural language processing, one is concerned with building parse trees that reveal the 
meanings of sentences. In speech recognition, there is usually no need for structural anal- 
ysis or any other deep processing of language. In linguistics, models are not probabilistic 
as one is only trying to express a binary grammaticality judgement. In speech recognition, 
models are almost exclusively probabilistic in order to express frequencies. 

Finally, linguistic grammars have traditionally been manually constructed. A linguist 
usually designs grammars without any automated aid. In contrast, models for speech recog- 
nition are built by taking statistics on large corpora of text. Such models have a great 
number of probabilities that need to be estimated, and this estimation is only practical 
through the automated analysis of on-line text. 



1.2 Applications for Probabilistic Models 

Probabilistic models of language are not only valuable in speech recognition, but they are 
also useful in applications as diverse as spelling correction, machine translation, and part-of- 
speech tagging. These and other applications can be placed in a single common framework 
( Bahl et al, 1983 ), the source- channel model used in information theory ( |Shannon, 1948 ). 
In this section, we explain how speech recognition can be placed in this framework, and 
then explain how other applications are just variations on this theme. 

The task of speech recognition can be framed as follows: for an acoustic signal A corre- 
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sponding to a sentence, we want to find the most probable transcription T, i.e., to find 



T = argmaxp(T\A) 



T 



However, building accurate models of p(T\A) directly is beyond current know-how;^ instead, 
one applies Bayes' rule to get the relation: 

n(T)p(A\T) 

T = argmax — — = avg max p(T)p(A\T) (1-1) 




The probability distribution p(T) is called a language model and describes how probable 
or frequent each sentence T is in language. The distribution p(A\T) is called an acous- 
tic model and describes which acoustic signals A are likely realizations of a sentence T. 
The language model p(T) corresponds to the probabilistic model of language for speech 
recognition discussed in the preceding section. 

The source- channel model in information theory describes the problem of recovering 
information that has been sent over a noisy channel. One has a model of the information 
source, p(I), and a model of the noisy channel p(0\I) describing the likely outputs O of the 
channel given an input /. (For a perfect channel, we would just have that p(0\I) = 1 for 
= 1 and p(0\I) = otherwise.) The task is to recover the original message I sent over 
the channel given the noisy output O received at the other end. This can be phrased as 
finding the message I with highest probability given O, or finding 



We can see an analogy with the task of speech recognition. The information source in this 
case is a person generating the text of a sentence according to the distribution p(T). The 
noisy channel corresponds to the process of a person converting this sentence from text to 
speech according to p(A\T). Finally, the goal is to recover the original text given the output 
of this noisy channel. While it may not be intuitive to refer to a channel that converts text 
to speech as a noisy channel, the mathematics are identical. 

The source-channel model is a powerful paradigm because it combines the model of the 
source and the model of the channel in an elegant and efficacious manner. For example, 
consider the previous example of an acoustic utterance A corresponding to the sentence he 
is too forbearing and the possible transcriptions: 



Here we have two transcriptions with identical pronunciations {p{A\T\) ~ p(A\T2)), but 
because the former sentence is much more common (p(Ti) ^> ^(Tb)) we get p(Ti)p(A\Ti) S> 

2 |Brown et al. (1995| ) present an explanation of why estimating p(T\A) is difficult. 




p(i)p(Q\i) 

p(0) 



arg max p(I)p(0\I) 
I 




Ti 

To 



he is too forbearing 
he is two four baring 
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p{T2)p{A\T2) and thus prefer transcription T\. On the other hand, consider 



T3 = he is very forbearing 

In this case, we have two transcriptions with very similar frequencies {p{T\) ~ p(T^)), but 
because T\ has a much higher acoustic score (p(^4|Ti) 3> p(A\T^)) we again prefer T\. Thus, 
we see that the source-channel model combines acoustic and language model information 
effectively to prefer transcriptions that both are likely to occur in language and match the 
acoustic signal well. 

The source-channel model can be extended to many other applications besides speech 
recognition by just varying the channel model used ( Brown et al, 1992b| ). In optical char- 



acter recognition and handwriting recognition ([Hull, 1992| ; grihari and Baltus, 1992]) , the 
channel can be interpreted as converting from text to image data instead of from text to 
speech, yielding the equation 

T = arg m&xp(T)p(image\T). 

T 



In spelling correction ( Kernighan et a/,, 1990| ), the channel can be interpreted as an imper- 



fect typist that converts perfect text T to noisy text T n with spelling mistakes, yielding 



T = arg m&xp(T)p(T n \T) . 
T 



In machine translation ( [Brown et at, 19*90 ), the channel can be interpreted as a translator 



that converts text T in one language into text Tf in a foreign language, yielding 

T = arg max p(T)p(Tf | T). (1-3) 
T 

In each of these cases, we try to recover the original text T given the output of a noisy 
channel, whether the noisy channel outputs image data, text with spelling errors, or text 
in a foreign language. 

By varying the source model, we can extend the source-channel model to further appli- 



cations. In part-of-speech tagging (Church, 1988), one attempts to label words in sentences 



with their part-of-speech. We can apply the source-channel model by taking the source 
to generate part-of-speech sequences Tp OS corresponding to sentences, and taking the chan- 
nel to convert part-of-speech sequences T pos to sentences T that are consistent with that 
part-of-speech sequence, yielding 

T pos = argmaxp(T pos )p(T|T pos ). 

Tpos 

In this case, we try to recover the original part-of-speech sequence T pos given the text output 
of the noisy channel. The same techniques used to build models p(T) for regular text can 
be used to build models p(T pos ) for part-of-speech sequences. 

Notice that in all of these applications it is necessary to build a source language model, 
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either p(T) or p(T pos ). Because of the importance of this task, this topic has become its 
own field, language modeling. Notice that the term language modeling is used specifically 
to refer to source language models such as p(T); we use the term models of language to 
include more general models such as channel models or linguistic grammars. The first two 
problems we examine in this thesis are concerned with improving language modeling. The 
third problem is concerned with a model very similar to a channel model, in particular the 
translation model p(Tf\T) in equation (|l.3| ). 



1.3 Problem Domains 

The three problems that we have selected investigate the task of modeling language at three 
different levels: words, constituents, and sentences. 



First, we consider the problem of smoothing n-gram language models ( Shannon, 1951 ). 
Such models are dominant in language modeling, yielding the best current performance. 
In such models, the probability of a sentence is expressed through the probability of each 
word in the sentence; such models are word-based models. The construction of an n-gram 
model is straightforward, except for the issue of smoothing, a technique used when there is 
insufficient data to estimate probabilities accurately. In this thesis, we introduce two novel 
smoothing methods that outperform existing methods, and present an extensive analysis of 
previous techniques. 

Next, we consider the task of statistically inducing a grammatical language model from 
text. While it seems logical to use the grammatical models developed in linguistics for 
probabilistic language modeling, previous attempts at this have not yielded strong results. 
Instead, we attempt to statistically induce a grammar from a large corpus of text. In 
grammatical language models, the probability of a sentence is expressed through the proba- 
bilities of the constituents within the sentence, and thus can be considered constituent-based. 
Though yet to perform as well as word-based models, grammatical models offer the best 
hope for significantly improving language modeling accuracy. We introduce a novel gram- 



mar induction algorithm based on the minimum description length principle flRissanen^ 



1978|) that surpasses the performance of existing algorithms. 

The third problem deals with the task of bilingual sentence alignment. There exist many 
corpora that contain equivalent text in multiple languages. For example, the Hansard corpus 
contains the Canadian parliament proceedings in both English and French. Multilingual 
corpora are useful for automatically building tools for machine translation such as bilingual 
dictionaries. However, current algorithms for building such tools require the specification 
of which sentence(s) in one language translate to each sentence in the other language, 
and this information is typically not included by human translators. Bilingual sentence 
alignment is the task of automatically producing this information. This turns out to be 
a difficult problem as a sentence in one language does not always correspond to a single 
sentence in the other language. Sentence alignment can be approached within the source- 
channel framework using equation ( |l.3|) as in machine translation; however, in this work 
we use a slightly different framework and express the translation model p(Tf\T) as a joint 
distribution p(T,Tf). As sentence alignment is concerned only with aligning text at the 
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sentence level, the models used are sentence-based. We design a sentence-based translation 
model that leads to an efficient and accurate alignment algorithm that outperforms previous 
algorithms. 

Finally, we discuss how our work on these three problems forwards research in probabilis- 
tic modeling. We compare the strategies used for building models in these three different 
domains from a Bayesian perspective, and demonstrate why different strategies are appro- 
priate for different domains. In addition, we show how the techniques we have developed 
address two central issues in probabilistic modeling. 



1.4 Bayesian Modeling 

The Bayesian framework is an elegant and very general framework for probabilistic model- 
ing. We explain the Bayesian framework through an example: consider the task of inducing 
a grammar G from some data or observations O. In the Bayesian framework, one attempts 
to find the grammar G that has the highest probability given the data O, i.e., to find 

G = argmaxp(G|0). 

G 

As it is difficult to estimate p(G\0) directly, we apply Bayes' rule to get 

G = argmax ^ ^^ = argm&x p(0\G)p(G). (1.4) 
G P{0) G 

The term p{0\G) describes the probability assigned to the data by the grammar, and is a 
measure of how well the grammar models the data. The term p{G) describes our a priori 
notion of how likely a given grammar G is.[] This division between model accuracy and the 
prior belief of model likelihood is a natural way of modularizing the grammar induction 
problem. 

While each of the three problems we investigated can be addressed within the Bayesian 
framework, we instead selected three dissimilar approaches. For the grammar induction 
problem, we apply the Bayesian framework in a straightforward manner. For the sentence 
alignment problem, we use ad hoc methods that can be loosely interpreted as Bayesian in 
nature. While the Bayesian framework is well-suited to sentence alignment, the use of ad 
hoc methods greatly simplified implementation at little or no cost in terms of performance. 
Finally, for smoothing n-gram models we use non-Bayesian methods. It is unclear how to 
select a prior distribution over smoothed n-gram models, and we have found that it is more 
effective to optimize performance directly than to optimize performance through examining 
different prior distributions. We conclude that while the Bayesian framework is elegant and 
general, in practice less elegant methods are often effective. 



3 While equation jl.4| ) is very similar to equation ( [L.l| ) describing the source-channel model for speech 
recognition, this equation differs in that p(G) is call ed a prior distribution and is built using a priori 
information. The analogous term p(T) in equation ( |l.l[ ) is called a language model and is built using 
modeling techniques. 
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1.5 Sparse Data and Inducing Hidden Structure 



Two issues that form a recurring theme in probabilistic modeling are the sparse data prob- 
lem and the problem of inducing hidden structure; these are perhaps the two most important 
issues in probabilistic modeling today. 

The sparse data problem refers to the situation when there is insufficient data to train 
one's model accurately. This problem is ubiquitous in statistical modeling; the models that 
perform well tend to be very large and thus require a great deal of data to train. There 
are two main approaches to addressing this problem. First, one can use the technique 
of smoothing, which describes methods for accurately estimating probabilities in the pres- 
ence of sparse data. Secondly, one can consider techniques for building compact models. 
Compact models have fewer parameters to train and thus require less data. 

The problem of inducing hidden structure describes the task of building models that 
express structure not overtly present in the training data. To give an example, consider the 
bigram model mentioned earlier, where the probability of the sentence he is too forbearing 
is expressed as 

p(he)p(is\ he)p(too\ is)p(forbearing\ too) 

Expressing the probability of a sentence in terms of the probability of each word in the sen- 
tence conditioned on the immediately preceding word does not seem particularly felicitous. 
Intuitively, it seems likely that by capturing the structure underlying language as is done 
in linguistics, one may be able to build superior models. We call this structure hidden as 
it is not explicitly demarcated in textj^j To date, bigram models and similar models greatly 
outperform models that attempt to model hidden structure, but methods that induce hid- 
den structure offer perhaps the best hope for producing models that significantly improve 
the current state-of-the-art. 

In this thesis, we present several techniques that help address these two central issues 
in probabilistic modeling. For the sparse data problem, we give novel techniques for both 
smoothing and for constructing compact models. In addition, we present novel techniques 
for inducing hidden structure that are not only effective but efficient as well. 



4 There is some data that has been manually annotated with this information, e.g., the Penn Treebank. 
However, manual annotation is expensive and thus only a limited amount of such data is available. 
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Chapter 2 

Smoothing n-Gram Models 



In this chapter, we describe work on the task of smoothing n-gram models (Chen and 
Goodman, 1996| ) .[^ Of the three structural levels at which we model language in this thesis, 
this represents work at the word level. We introduce two novel smoothing techniques 
that significantly outperform all existing techniques on trigram models, and that perform 
competitively on bigram models. We present an extensive empirical comparison of existing 
smoothing techniques, which was previously lacking in the literature. 



2.1 Introduction 

As mentioned in Chapter [l], language models are a staple in many domains including speech 
recognition, optical character recognition, handwriting recognition, machine translation, 
and spelling correction. A language model is a probability distribution p(s) over strings s 
that attempts to reflect how frequently a string s occurs as a sentence. For example, for a 
language model describing spoken language, we might have p(hello) ~ 0.01 since perhaps 
one out of every hundred sentences a person speaks is hello. On the other hand, we would 
have p(chicken funky overload ketchup) rj and p{asbestos gallops gallantly) since it 
is extremely unlikely anyone would utter either string. Notice that unlike in linguistics, 
grammaticality is irrelevant in language modeling. Even though the string asbestos gallops 
gallantly is grammatical, we still assign it a near-zero probability. Also, notice that in 
language modeling we are only interested in the frequency with which a string occurs as 
a complete sentence. For instance, we have p{you today) ~ even though the string you 
today occurs frequently in spoken language, as in how are you today. 

By far the most widely used language models are n-gram language models. We introduce 
these models by considering the case n = 2; these models are called bigram models. First, 
we notice that for a sentence s composed of the words w\- ■ -wi, without loss of generality 
we can express p(s) as 

l 

p(s) = p(wi)p(w 2 \wi)p(w 3 \wiW 2 ) ■ ■ ■ p(lVl\wi ■ ■ ■ Wl-l) = Y[p(lUi\wi - ■ -Wi-l) 

1=1 

1 This research was joint work with Joshua Goodman. 
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In bigram models, we make the approximation that the probability of a word only depends 
on the identity of the immediately preceding word, giving us 



I i 

p( s ) = n p( w i\ w i ■ • • w i-i) ~ n p{wt\wi-i) (2.1) 

8=1 1=1 

To make p{wi\wi-\) meaningful for i = 1, we can pad the beginning of the sentence with 
a distinguished token Wbos! that is, we pretend wq is u>bos- I n addition, to make the sum 
of the probabilities of all strings J2 s p( s ) equal 1, it is necessary to place a distinguished 
token w cos at the end of sentences and to include this in the product in equation (|2.1|) For 
example, to calculate p(John read a book) we would take 

p(John read a book) = p(John\wbo S )p(read\John)p(a\read)p(book\a)p(w eos \book) 

To estimate p(wi\wi-\), the frequency with which the word Wi occurs given that the last 
word is Wi-i, we can simply count how often the bigram Wi-±Wi occurs in some text and 
normalize; that is, we can take 

= ^ 7 r (2.2) 

where c{wi-\Wi) denotes the number of times the bigram Wi-\Wi occurs in the given 
text.0 The text available for building a model is called training data. For n-gram mod- 
els, the amount of training data used is typically many millions of words. The estimate 
for p(wi\wi^i) given in equation ( |2.2|) is called the maximum likelihood (ML) estimate of 
p(wi\wi-i), because this assignment of probabilities yields the bigram model that assigns 
the highest probability to the training data of all possible bigram models.^ 

For n-gram models where n > 2, instead of conditioning the probability of a word on 
the identity of just the preceding word, we condition this probability on the identity of the 



2 Without this, consider the total probability associated with one-word strings. We have 

s — wi wi 

That is, the probabilities associated with one-word strings alone sum to 1. Similarly, without this device we 
would have that the total probability of strings of exactly length k is 1 for all k > 0, giving us 



(=1 l(s) = l 



3 The expression c(wi-iWi) in equation (2.2) can also be expressed as simply c(u>i_i), the number 
of times the word occurs. However, we generally use the summation form as this highlights the fact 

that this expression is used for normalization. 

4 The probability of some data given a language model is just the product of the probabilities of each 
sentence in the data. For training data 5 composed of the sentences (si, . . . , Si s ), we have p(S) = Y[[=i p( s 0- 
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last n — 1 words. Generalizing equation (2.1) to n > 2, we get 



p(s) = Y[p(wi\wl_l +1 ) 



i=l 



where w\ denotes the words w^ - ■ ■ Wjf^ To estimate the probabilities p{wf\w\_ n ,f), the 
analogous equation to equation ( |2.2[ ) is 

_ C ( W i-n+l ^ 
J2u>i c ( w i-n+l 



p(wi\ Wi _ n+1 ) = — . (2.3) 



In practice, the largest n in wide use is n = 3; this model is referred to as a trigram model. 

Let us consider a small example. Let our training data S be composed of the three 
sentences 

{John read Moby Dick, Mary read a different book, she read a book by Cher) 

and let us calculate p(John read a book) for the maximum likelihood bigram model. We 
have 



p(John\w hos ) 
p(read\John) 
p{a\read) 
p(book\ a) 
p(w eos \book) 



c(wbos John) 1 
J2 W c(w hos w) 3 

c( John read) 1 
J2 W c(John w) 1 
c(read a) 2 
J2 W c(read w) 3 

c(a book) 1 
J2 w c(aw) 2 

c(book w eos ) _ 1 
J2 W c(book w) 2 



giving us 



p(John read a book) = p(John\w^ os )p(read\John)p(a\read)p(book\a)p(w eos \book) 

1 2 11 

= -xlx-x-x-« 0.06 
3 3 2 2 

Now, consider the sentence Moby read a book. We have 

c(Moby read) 



p{read\Moby) 



J2 W c(Moby w) 1 



so we will get p(Moby read a book) = 0. Obviously, this is an underestimate for the proba- 
bility p(Moby read a book) as there is some probability that the sentence occurs. To show 



5 Instead of padding the beginning of sentences with a single Wb os as in a bigram model, we need to pad 
sentences with n — 1 w^g's for an n-gram model. 
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why it is important that this probability should be given a nonzero value, we turn to the 
primary application for language models, speech recognition. As described in Chapter [l], 
in speech recognition one attempts to find the sentence s that maximizes p(A\s)p(s) for a 
given acoustic signal A. If p(s) is zero, then p(A\s)p(s) will be zero and the string s will 
never be considered as a transcription, regardless of how unambiguous the acoustic signal 
is. Thus, whenever a string s such that p(s) = occurs during a speech recognition task, 
an error will be made. Assigning all strings a nonzero probability helps prevent errors in 
speech recognition. 

Smoothing is used to address this problem. The term smoothing describes techniques 



for adjusting the maximum likelihood estimate of probabilities (as in equations (2.2) and 
( |2.3| )) to produce more accurate probabilities. Typically, smoothing methods prevent any 
probability from being zero, but they also attempt to improve the accuracy of the model as 
a whole. Whenever a probability is estimated from few counts, smoothing has the potential 
to significantly improve estimation. For instance, from the three occurrences of the word 
read in the above example we have that the maximum likelihood estimate of the probability 
that the word a follows the word read is |. As this estimate is based on three counts, 
we do not have great confidence in this estimate and intuitively, it is a gross overestimate. 
Smoothing would typically greatly lower this estimate. 

The name smoothing comes from the fact that these techniques tend to make distri- 
butions more uniform, which can be viewed as making them smoother. Typically, very 
low probabilities such as zero probabilities are adjusted upward, and high probabilities are 
adjusted downward. 

To give an example, one simple smoothing technique is to pretend each bigram occurs 
once more than it actually does flLidstone, 192C ; Johnson, 1932; |Jeffreys, 1948| ), yielding 



I | s c(Wi-lWi) + 1 c(Wi-lWi) + 1 , 

p + i{wi\wi-i) = — -j- — — i = — — 2.4) 

22wl c \ w i-l w i) + !J T,w C[Wi-lWi) + \V\ 

where V is the vocabulary, the set of all words being considered^] Let us reconsider the 
previous example using this new distribution, and let us take our vocabulary V to be the 
set of all words occurring in the training data S, so that we have |^| = 11. 
For the sentence John read a book, we have 

p{John read a book) = p{John\w] 30a )p{read\John)p{a\read)p{book\a)p{w cos \book) 

2 2 3 2 2 
= — x — x — x — x — k, 0.0001 
14 12 14 13 13 

In other words, we estimate that the sentence John read a book occurs about once every ten 
thousand sentences. This is much more reasonable than the maximum likelihood estimate 
of 0.06, or about once every seventeen sentences. For the sentence Moby read a book, we 



6 Notice that if V is taken to be infinite, the denominator is infinite and all probabilities are set to zero. 
In practice, vocabularies are typically fixed to be tens of thousands of words. All words not in the vocabulary 
are mapped to a single distinguished word, usually called the unknown word. 
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have 



p(Moby read a book) = p(Moby\w\ }OS )p(read\Moby)p(a\read)p(book\a)p(w cos \book) 

113 2 2 
= — x — x — x — x — w 0.00003 
14 12 14 13 13 

Again, this is more reasonable than the zero probability assigned by the maximum likelihood 
model. 

While smoothing is a central issue in language modeling, the literature lacks a definitive 



comparison between the many existing techniques. Previous studies (Nadas, 1984; Katz, 



1987t |Church and Gale, 1991] ; |MacKay and Peto, 1995| ) only compare a small number of 



methods (typically two) on a single corpus and using a single training data size. As a result, 
it is currently difficult for a researcher to intelligently choose between smoothing schemes. 

In this work, we carry out an extensive empirical comparison of the most widely used 
smoothing techniques, including those described by Jelinek and Mercer (1980), Katz(1987), 
and Church and Gale (1991). We carry out experiments over many training data sizes on 
varied corpora using both bigram and trigram models. We demonstrate that the relative 
performance of techniques depends greatly on training data size and n-gram order. For 
example, for bigram models produced from large training sets Church-Gale smoothing has 
superior performance, while Katz smoothing performs best on bigram models produced 
from smaller data. For the methods with parameters that can be tuned to improve per- 
formance, we perform an automated search for optimal values and show that sub-optimal 
parameter selection can significantly decrease performance. To our knowledge, this is the 
first smoothing work that systematically investigates any of these issues. 

In addition, we introduce two novel smoothing techniques: the first belonging to the 
class of smoothing models described by Jelinek and Mercer, the second a very simple linear 
interpolation method. While being relatively simple to implement, we show that these 
methods yield good performance in bigram models and superior performance in trigram 
models. 

We take the performance of a method m to be its cross-entropy on test data 

1 It 

— ^-log 2 p m {ti) 

NT i=l 

where p m (ti) denotes the language model produced with method m and where the test 
data T is composed of sentences (t\, . . . , ti T ) and contains a total of Nt words. The cross- 
entropy, which is sometimes referred to as just entropy, is inversely related to the average 
probability a model assigns to sentences in the test data, and it is generally assumed that 
lower entropy correlates with better performance in applications. Sometimes the entropy 
is reported in terms of a perplexity value; an entropy of H is equivalent to a perplexity of 
2 H . The perplexity can be interpreted as the inverse (r) of the average probability (p) with 
which words are predicted by a model. Typical perplexities yielded by n-gram models on 
English text range from about 50 to several hundred, depending on the type of text. 

In addition to evaluating the overall performance of various smoothing techniques, we 
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provide more detailed analyses of performance. We examine the performance of different 
algorithms on n-grams with particular numbers of counts in the training data; we find 
that Katz and Church-Gale smoothing most accurately smooth n-grams with large counts, 
while our two novel methods are best for small counts. We calculate the relative impact 
on performance of small counts and large counts for different training set sizes and n-gram 
orders, and use this data to explain the variation in performance of different algorithms in 
different situations. Finally, we discuss several miscellaneous points including how Church- 
Gale smoothing compares to linear interpolation, and how deleted interpolation compares 
with held-out interpolation. 



2.2 Previous Work 



2.2.1 Additive Smoothing 

The simplest type of smoothing used in practice is additive smoothing ( [Lidstone, 1920| ; 
Johnson, 1932| ; Jeffreys, 1948| ), which is just a generalization of the smoothing given in 
equation (|2.4|), Instead of pretending each n-gram occurs once more than it does, we 
pretend it occurs 5 times more than it does, where typically < 5 < 1, i.e., 



Padd 



c(w. 



i— n+1 



+ 6 



{wMzi+l) ~j: w A<-n + x) + s\v\ 



(2.5) 



Lidstone and Jeffreys advocate taking 5 = 1. Gale and Church (|199C ; 1994 ) have argued 
that this method generally performs poorly. 



2.2.2 Good-Turing Estimate 

The Good- Turing estimate ( Pood, 1953 ) is central to many smoothing techniques. The 
Good- Turing estimate states that for any n-gram that occurs r times, we should pretend 
that it occurs r* times where 

r * = ( r + i)2r±l (2.6) 
n r 

and where n r is the number of n-grams that occur exactly r times in the training data. To 
convert this count to a probability, we just normalize: for an n-gram a with r counts, we 
take 

PGT(a) = - (2.7) 

where N is the total number of counts in the distribution. 

To derive this estimate, assume that there are a total of s different n-grams ct\, . . . ,a s 
and that their true probabilities or frequencies are p±, . . . ,p s , respectively. Let c(ctj) denote 
the number of times the n-gram on occurs in the given training data. Now, we wish to 
calculate the true probability of an n-gram on that occurs r times; we can interpret this as 
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calculating E(pi\c(a.i) = r), where E denotes expected value. This can be expanded as 

s 

E(pi\c(ai) = r) = ^2p(i = j\c(ai) = r)pj (2.8) 

3=1 

The probability p(i = j\c(ai) = r) is the probability that a randomly selected n-gram a, 
with r counts is actually the jth n-gram a.j. This is just 



where N = X)f=i c ( a i)> the total number of counts. Substituting this into equation ( p.8| ), 
we get 

EfaWoi) =r) = E h lP K {1 ~ P iC~ r ( 2 - 9 ) 

Then, consider E^{n r ), the expected number of n-grams with exactly r counts given 
that there are a total of N counts. This is equal to the sum of the probability that each 
n-gram has exactly r counts: 

E N (n r ) = J>(c(a,) =r) = £ f *W - Pi)""*" 
i=l i=l V r / 



We can substitute this expression into equation ( |2.9| ) to yield 

r + 1 E N+1 (n r+1 ) 



E(pi\c(ai) = r) 



N + l E N (n r 



This is an estimate for the expected probability of an n-gram a, with r counts; to express 
this in terms of a corrected count r* we use equation (p.7|) to get 



r = Np(ai) = N w (r + 1) 

TV + 1 hN{n r ) n r 

Notice that the approximations E^{n r ) n r and -^mSjv+iO'JT+l) ~ ?V+-i are used in the 
above equation. In other words, we use the empirical values of n r to estimate what their 
expected values are. 

The Good- Turing estimate yields absurd values when n r = 0; it is generally necessary 
to "smooth" the n r , e.g., to adjust the n r so that they are all above zero. Recently, 
Gale and Sampson (1995] ) have proposed a simple and effective algorithm for smoothing 



these values. 

In practice, the Good- Turing estimate is not used by itself for n-gram smoothing, be- 
cause it does not include the interpolation of higher-order models with lower-order models 
necessary for good performance, as discussed in the next section. However, it is used as a 
tool in several smoothing techniques. 
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2.2.3 Jelinek-Mercer Smoothing 

Consider the case of constructing a bigram model on training data where we have that 
c(burnish the) = and c(burnish thou) = 0. Then, according to both additive smoothing 
and the Good- Turing estimate, we will have 

p(the\burnish) = p{thou\burnish) 

However, intuitively we should have 

p(the\burnish) > p{thou\burnish) 

because the word the is much more common than the word thou. To capture this behavior, 
we can interpolate the bigram model with a unigram model. A unigram model is just a 
1-gram model, which corresponds to conditioning the probability of a word on no other 
words. That is, the unigram probability of a word just reflects its frequency in text. For 
example, the maximum likelihood unigram model is 

/ \ c ( w i) 
PML{Wi) - 



E Wl c{wi) 

We can linearly interpolate a bigram model and unigram model as follows: 
Pmterp(Wi\wi-i) = A PML(™iK-l) + (1 ~ A) PML (, w i) 

where < A < 1. Because PMh(the\ burnish) = pya^{thov\ burnish) = while PMh(the) 3> 
PMh(thou), we will have that 

Pinterp(the\burnish) > pi n terp{thou\burnish) 

as desired. 

In general, it is useful to linearly interpolate higher-order n-gram models with lower- 
order n-gram models, because when there is insufficient data to estimate a probability in the 
higher-order model, the lower-order model can often provide useful information. A general 
class of interpolated models is described by Jelinek and Mercer (1980| ). An elegant way of 



performing this interpolation is given by Brown et al. (1992a ) as follows 



PinterpKKj^) = A^i-i PML^K-n+l) + i 1 ~ X w ^r\ , ) PinterpKK _i +2 ) 

The nth-order smoothed model is defined recursively as a linear interpolation between the 
nth-order maximum likelihood model and the (n — l)th-order smoothed model. To end 
the recursion, we can take the smoothed lst-order model to be the maximum likelihood 
distribution, or we can take the smoothed Oth-order model to be the uniform distribution 

PunifOi) = j^T 
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Given fixed pml, it is possible to search efficiently for the A <-i that maximize the 



probability of some data using the Baum- Welch algorithm ( Baum, 1972| ). To yield mean- 



ingful results, the data used to estimate the A »-i need to be disjoint from the data used 

i— n + l 

to calculate the pml-Q In held-out interpolation, one reserves a section of the training data 
for this purpose. Alternatively, Jelinek and Mercer describe a technique called deleted in- 
terpolation where different parts of the training data rotate in training either the pml or 
the A i-i ; the results are then averaged. 

w i-n+l 

Training each parameter A *-i independently is not generally felicitous; we would 

i — n + l 

need an enormous amount of data to train so many independent parameters accurately. 
Instead, Jelinek and Mercer suggest dividing the A »-i into a moderate number of sets, 

i — n + l 

and constraining all A %-\ in the same set to be equal, thereby reducing the number of 

w i-n+l 

independent parameters to be estimated. Ideally, we should tie together those A w »-i that 

-n+l 



we have an a priori reason to believe should have similar values. Bahl et al. (1983 ) suggest 
choosing these sets of A i-i according to J2w c ( w l-n+i)> the total number of counts in 

i— n+l 1 

the higher-order distribution being interpolated. The general idea is that this total count 
should correlate with how strongly the higher-order distribution should be weighted. That 
is, the higher this count the higher A i-i should be. Distributions with the same number 

i — n+l 

of total counts should have similar interpolation constants. More specifically, Bahl et al. 
suggest dividing the range of possible total count values into some number of partitions, 
and to constrain all A >-i associated with the same partition to have the same value. 

i — n + l 

This process of dividing n-grams up into partitions and training parameters independently 
for each partition is referred to as bucketing. 

2.2.4 Katz Smoothing 

The other smoothing technique besides Jelinek-Mercer smoothing used widely in speech 
recognition is due to Katz (1987 ). Katz smoothing ( |1987| ) extends the intuitions of Good- 



Turing by adding the interpolation of higher-order models with lower-order models. 

We first describe Katz smoothing for bigram models. In Katz smoothing, for every count 
r > a discount ratio d r is calculated, and any bigram with r > counts is assigned a 
corrected count of d r r counts. Then, to calculate a given conditional distribution p(wi\wi-i), 
the nonzero counts are discounted according to d r , and the counts subtracted from the 
nonzero counts in that distribution are assigned to the bigrams with zero counts. These 
counts assigned to the zero-count bigrams are distributed proportionally to the next lower- 
order ra-gram model, i.e., the unigram model. 

In other words, if the original count of a bigram c{w\_ l ) is r, we calculate its corrected 
count as follows: 

i i \ I d r r if r > 1 

o^u K-i) = ( a ^ w ifr = „ P-io) 

where a is chosen such that the total number of counts in the distribution J2w c katz(^i-i) 



7 When the same data is used to estimate both, setting all A s-i to one yields the optimal result. 

w i-n+l 
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is unchanged, i.e., c katz«_i) = Y, Wl c ( w i-i) ■ To calculate Pkatz(«>t|wi_i) from tne 
corrected count, we just normalize: 



E^CkatzK-l) 

The <i r are calculated as follows: large counts are taken to be reliable, so they are not 
discounted. In particular, Katz takes d r = 1 for all r > k for some k, where Katz suggests 
k = 5. The discount ratios for the lower counts r < k are derived from the Good- Turing 
estimate applied to the global bigram distribution; that is, the n r in equation ( ^6| ) denote 
the total numbers of bigrams that occur exactly r times in the training data. These d r are 
chosen in such a way that the resulting discounts are proportional to the discounts predicted 
by the Good- Turing estimate, and such that the total number of counts discounted in the 
global bigram distribution is equal to the total number of counts that should be assigned to 
bigrams with zero counts according to the Good- Turing estimate]^] The former constraint 
corresponds to the equation 

r* 

1 — d r = u(l ) 

r 

for all 1 < r < k for some constant /j,. Good- Turing estimates that the total number of 
counts that should be assigned to bigrams with zero counts is uqO* = = m, so the 

second constraint corresponds to the equation 

k 

^n r (l - d r )r = m 

r=l 

The unique solution to these equations is given by 

r* _ (fc+l)n fc+1 
A — r "1 



i _ (fc+jQnfc+i 



Katz smoothing for higher-order n-gram models is defined analogously. As we can see 
in equation ( |2.10P , the bigram model is defined in terms of the unigram model; in general, 
the Katz n-gram model is defined in terms of the Katz (n — l)-gram model, similar to 
Jelinek-Mercer smoothing. To end the recursion, the Katz unigram model is taken to be 
the maximum likelihood unigram model: 

PkatzKWi) = PML{Wi) ~ 



T, Wi c(Wi) 



Recall that we mentioned in Section 2.2.2 that it is usually necessary to smooth n 



when using the Good- Turing estimate, e.g., for those n r that are very low. However, in 



8 In the normal Good- Turing estimate, the number of counts discounted from n-grams with nonzero 
counts happens to be equal to the number of counts assigned to n-grams with zero counts. Thus, the 
normalization constant for a smoothed distribution is identical to that of the original distribution. In Katz 
smoothing, Katz tries to achieve a similar effect except through discounting only counts r < k. 
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Jelinek-Mercer 


Nadas 


Katz 


bigram 
trigram 


118 

89 


119 
91 


117 

88 



Table 2.1: Perplexity results reported by Katz and Nadas on 100 test sentences 



Katz smoothing this is not essential because the Good- Turing estimate is only used for 
small counts r < k, and n r is generally fairly high for these values of r. 

Katz compares his algorithm with an unspecified version of Jelinek-Mercer deleted esti- 
mation and with Nadas smoothing flNadas, 1984 ) using 750,000 words of training data from 
an office correspondence database. The perplexities displayed in Table [27i] are reported 
for a test set of 100 sentences. (Recall that smaller perplexities are desirable.) Katz con- 
cludes that his algorithm performs at least as well as Jelinek-Mercer smoothing and Nadas 
smoothing. 



2.2.5 Church-Gale Smoothing 



Church and Gale (1991 ) describe a smoothing method that like Katz's, combines the Good- 
Turing estimate with a method for merging the information from lower-order models and 
higher-order models. 

We describe this method for bigram models. To motivate this method, consider using 
the Good- Turing estimate directly to build a bigram distribution. For each bigram with 



count r, we would assign a corrected count of r* = (r + 1) r+1 . As noted in Section 2.2.3 , 
this has the undesirable effect of giving all bigrams with zero count the same corrected 
count; instead, unigram frequencies should be taken into account. Consider the corrected 
count assigned by an interpolative model to a bigram w}__ 1 with zero counts. In such a 
model, we would have 

p(Wi\Wi-l) OC p(Wi) 

for a bigram with zero counts. To convert this probability to a count, we multiply by the 
total number of counts in the distribution to get 

p(wi\wi-.x) c (- w i-i) K P( w i) Y c ( w i~i) = V{wi)c(wi^i) oc v{wi)p{wi_i) 

Wi Wi 

Thus, p(wi-i)p(wi) may be a good indicator of the corrected count of a bigram w\_i with 
zero counts. 

In Church-Gale smoothing, bigrams w l i _ 1 are partitioned or bucketed according to the 
value of PM.h(wi—i)PM.h(wi)- That is, they divide the range of possible PMh(wi-i)pML(wi) 
values into a number of partitions, and all bigrams associated with the same subrange are 
considered to be in the same bucket. Then, each bucket is treated as a distinct probability 
distribution and Good- Turing estimation is performed within each. For a bigram in bucket 
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test set 


Jelinek-Mercer 


MacKay-Peto 


size (words) 


3 A's 15 A's 150 A's 




260,000 


79.60 


79.90 


243,000 


89.57 88.47 88.91 


89.06 


116,000 


91.82 


92.28 



Table 2.2: Perplexity results reported by MacKay and Peto on three test sets 



b with 77, counts, we calculate its corrected count r% as 

.n>b,r+l 



(n + i) 



where the counts rif,^ include only those bigrams within bucket b. 

Church and Gale partition the range of possible PMh{wi-i)PMh{wi) values into about 35 
buckets, with three buckets in each factor of 10. To smooth the n\, tT for the Good- Turing 
estimate, they use a smoother by Shirey and Hastie (1988). 

While extensive empirical analysis is reported, they present only a single entropy result, 
comparing the above smoothing technique with another smoothing method introduced in 
their paper, extended deleted estimation. 



2.2.6 Bayesian Smoothing 

Several smoothing techniques are motivated within a Bayesian framework. A prior distri- 
bution over smoothed distributions is selected, and this prior is used to somehow arrive at 



a final smoothed distribution. For example, Nadas (1984) selects smoothed probabilities to 



be their mean a posteriori value given the prior distribution. 



Nadas (1984j ) hypothesizes a prior distribution from the family of beta functions. The 
reported experimental results are presented in Table |2.1| . (The same results are reported in 
the Katz and Nadas papers.) These results indicate that Nadas smoothing performs slightly 
worse than Katz and Jelinek-Mercer smoothing. 



MacKay and Peto (1995 ) use Dirichlet priors in an attempt to motivate the linear in- 
terpolation used in Jelinek-Mercer smoothing. They compare their method with Jelinek- 
Mercer smoothing for a single training set of about two million words. For Jelinek-Mercer 
smoothing, deleted interpolation was used dividing the corpus up into six sections. The 
parameters A were bucketed as suggested by Bahl et al. (1983), and three different bucket- 



ing granularities were tried. They report results for three different test sets; these results 
are displayed in Table ^L^. These results indicate that MacKay-Peto smoothing performs 
slightly worse than Jelinek-Mercer smoothing. 



2.3 Novel Smoothing Techniques 

Of the great many novel methods that we have tried, two techniques have performed espe- 
cially well. 
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Figure 2.1: A values for old and new bucketing schemes for Jelinek- Mercer smoothing; each 
point represents a single bucket 



2.3.1 Method average-count 

This scheme is an instance of Jelinek-Mercer smoothing. Recall that one takes 



Pinterp(Wi\wl 



-n+l/ 



K*- 1 , , PMiMK_* +1 ) + (i - \ 



)pi n terp(^m_n+ 2 )' 



where Bahl et al. suggest that the A <-i are bucketed according to J2w c { w \-n+i)i the 

i— n+l 1 

total number of counts in the higher-order distribution. We have found that partitioning 

V c(wj_ n ,) 

the A, i-i according to the average number of counts per nonzero element , — w ) ; - n Wn , 

'"i-n+l l ul i :c t u 'i_ n +lJ> u 

yields better results. 

-n+l) 



Intuitively, the less sparse the data for estimating PMh{wi\w\_ 



the larger A 

° 7. 



-n + l 

should be. While the larger the total number of counts in a distribution the less sparse the 
distribution tends to be, this measure ignores the allocation of counts between words. For 
example, we would consider a distribution with ten counts distributed evenly among ten 
words to be much more sparse than a distribution with ten counts all on a single word. The 
average number of counts per word seems to more directly express the concept of sparseness. 



In Figure 2.1, we graph the value of A assigned to each bucket under the original and 
new bucketing schemes on identical data. The x-axis in each graph represents the criteria 
used for bucketing. Notice that the new bucketing scheme results in a much tighter plot, 
indicating that it is better at grouping together distributions with similar behavior. 

One can use the Good- Turing estimate to partially explain this behavior. As mentioned 



in Section 2.2.4 , the Good- Turing estimate states that the number of counts that should 
be devoted to n-grams with zero counts is ni, the number of n-grams in the distribution 
with exactly one count. This is equivalent to assigning a total probability of to n-grams 
with zero counts, where N is the total number of counts in the distribution. Notice that 
the value 1 — A »-i in Jelinek-Mercer smoothing is roughly proportional to the total 

i — n-f-1 

probability assigned to n-grams with zero counts: for n-grams w\_ n+l with zero count we 
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have PML(wi\wl-l l+1 ) = so 

Pinterp(>iKln+l) = (1 - A^i-i ) Pmterp(^ik^ +2 ) 



Thus, it seems reasonable that we want to satisfy the relation 

1 — A oc — 

W i-n + l N 

where in this case n\ = \wi : c(iof_ n+1 ) = 1| and N = J2w t c ( w l-n+i)^\ 

Our goal in choosing a bucketing scheme is to bucket n-grams that should have similar 
A values. Hence, given the above analysis we should bucket n-grams w\_ n+1 according to 
the value of 

m _ \wi: c{w\_ n+l ) = i| 



This is very similar to the inverse of 

J2 Wi c ( w i-n+l) 



(2.11) 



K : c ( w i-n+l) > °l ' 

the actual value we use to bucket with. Instead of looking at the number of n-grams with 
exactly one count, we use the number of n-grams with nonzero counts. Notice that it does 
not matter much whether we bucket according to a value or according to its inverse; the 
same n-grams are grouped together. 

A natural experiment to try is to bucket according to the expression given in equa- 
tion (pTlp . However, using this expression and variations, we were unable to surpass the 
performance of the given bucketing scheme. 

2.3.2 Method one-count 

This technique combines two intuitions. First, MacKay and Peto (1995) show that by using 



a Dirichlet prior as a prior distribution over possible smoothed distributions, we get (roughly 
speaking) a model of the form 



n it: , ; -r i t i) i ir , i /('' ' 

Pone^K-n+l) 



i-1 v _ C [ W i-n+l) + aPone(Wi\w^ n+2) 



T, m c(wl_ n+1 )+a 

where a is constant across n-grams. This is similar to additive smoothing, except that 
instead of adding the same number of counts to each n-gram, we add counts proportional 



9 Notice that n\ and N have different meanings in this context from those found in Katz smoothing and 
Church-Gale smoothing. While in each of these cases we use the Good- Turing estimate, we apply the estimate 
to different distributions. In Katz, we apply the Good- Turing estimate to the global n-gram distribution, 
so that ni represents the total number of n-grams with exactly one count and N represents the total count 
of n-grams. Church-Gale is similar to Katz except that n-grams are partitioned into a number of buckets. 
However, in this context we apply the Good- Turing estimate to a conditional distribution p(wi|w*Z^ +1 ) for 
some fixed w\zl l+l - Thus, ni represents the number of n-grams w\_ n+1 beginning with w*Z^ +1 with exactly 
one count, and N represents the total count of n-grams w\_ n+1 beginning with u>*~ n+1 . 
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to the probability yielded by the next lower-order distribution. The parameter a represents 
the total number of counts that we add to the distribution. 

Secondly, using a similar analysis as in the last section, the Good- Turing estimate can 
be interpreted as stating that the number of extra counts a should be proportional to m, 
the number of n-grams with exactly one count in the given distribution. Thus, instead of 
taking a to be constant across n-grams w\_ n+1 , we take it to be a function of n\ = \wi : 
c{w\_ n+1 ) = 1|. We have found that taking 

a = 7 (m + /3) (2.12) 

works well, where (3 and 7 are constants. Notice that higher-order models are defined 
recursively in terms of lower-order models. 

Given the results mentioned in the last section, a natural experiment to try is to take 
a to be a function of the number of nonzero counts in the distribution, as opposed to the 
number of one counts. However, attempts in this vein failed to yield superior results. 

2.4 Experimental Methodology 

In our experiments, we compare our novel smoothing methods with the most widely-used 
smoothing techniques in language modeling: additive smoothing, Jelinek- Mercer smoothing, 
and Katz smoothing. For Jelinek-Mercer smoothing, we try both held-out interpolation 
and deleted interpolation. In addition, we have also implemented Church-Gale smoothing, 
as this has never been compared against popular techniques. We do not consider Nadas 
smoothing or MacKay-Peto smoothing as they are not widely used and as previous results 
indicate that they do not perform as well as other methods. 

As a baseline method, we choose a simple instance of Jelinek-Mercer smoothing, one 
that uses much fewer parameters than is typically used in real applications. 

2.4.1 Smoothing Implementations 

In this section, we discuss the details of our implementations of various smoothing tech- 
niques. The titles of the following sections include the mnemonic we use to refer to the 
implementations in later sections. We use the mnemonic when we are referring to our spe- 
cific implementation of a smoothing method, as opposed to the algorithm in general. For 
each method, we mention the parameters that can be tuned to optimize performance; in 
general, any variable mentioned is a tunable parameter. 

To give an informal estimate of the difficulty of implementation of each method, in Table 



2.3 we display the number of lines of C++ code in each implementation excluding the core 



code common across techniques. 

For interp-baseline, we used the interp-held-out code as it is just a special case. Written anew, it 
probably would have been about 50 lines. 
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Method 


Lines 


plus-one 


40 


plus-delta 


40 


katz 


300 


church-gale 


1000 


interp-held-out 


400 


interp-del-int 


400 


new- avg- count 


400 


new-one-count 


50 


interp-baseline lu 


400 



Table 2.3: Implementation difficulty of various methods in terms of lines of C++ code 



Additive Smoothing (plus-one, plus-delta) 

We consider two versions of additive smoothing. Referring to equation (|2.5|) in Section 



2.2.1, we fix 5 = 1 in plus-one smoothing. In plus-delta, we consider any 5. (The values 
of parameters such as 5 are determined through training on held-out data.) 



Jelinek-Mercer Smoothing (interp-held-out, interp-del-int) 

Recall that higher-order models are defined recursively in terms of lower-order models. 
We end the recursion by taking the Oth-order distribution to be the uniform distribution 

PunifW = VM- 

We bucket the A »-i according to J2 W c ( w l-n+l) as suggested by Bahl et al. Intu- 

i— n+1 1 

itively, each bucket should be made as small as possible, to only group together the most 
similar n- grams, while remaining large enough to accurately estimate the associated pa- 
rameters. We make the assumption that whether a bucket is large enough for accurate 
parameter estimation depends on how many n-grams that fall in that bucket occur in the 
data used to train the A's. We bucket in a such a way that a minimum of c m j n n-grams fall 
in each bucket. We start from the lowest possible value of J2w c ( w l- n +i) (*- e -> zero) and 
put increasing values of J2w c i w l-n+i) the same bucket until this minimum count is 
reached. We repeat this process until all possible values of J2w c ( w l-n+i) are bucketed. If 
the last bucket has fewer than c m \ n counts, we merge it with the preceding bucket. Histor- 
ically, this process is called the wall of bricks (Magerman, 1994). We use separate buckets 
for each n-gram model being interpolated. 

In performing this bucketing, we create an array containing how many n-grams occur 
for each value of J2 Wi c ( w l-n+i) U P t° some maximum value of J2w c (. w l-n+i)i which we 
call c top - For n-grams w\Z^ +1 with J2 Wl c ( w i-n+i) > c to P , we pretend c{w\_ n+1 ) = c top 
for bucketing purposes. 

As mentioned in Section 2,2.3j , the A's can be trained efficiently using the Baum- Welch 



algorithm. Given initial values for the A's, the Baum- Welch algorithm adjusts these param- 
eters iteratively to minimize the entropy of some data. The algorithm generally decreases 
the entropy with each iteration, and guarantees not to increase it. We set all A's initially 
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to the value Ao- We terminate the algorithm when the entropy per word changes less than 
<5 s to P bits between iterations. 

We implemented two versions of Jelinek-Mercer smoothing, one using held-out interpo- 
lation and one using deleted interpolation. In interp-held-out, the A's are trained using 
held-out interpolation on one of the development test sets. In interp-del-int, the A's are 
trained using the relaxed deleted interpolation technique described by Jelinek and Mercer, 
where one word is deleted at a time. In interp-del-int, we bucket an n-gram according 
to its count before deletion, as this turned out to significantly improve performance. We 
hypothesize that this is because this causes an n-gram to be placed in the same bucket dur- 
ing training as in evaluation, allowing the A's to be meaningfully geared toward individual 
n-grams. 



Katz Smoothing (katz) 



Referring to Section 2.2.4 , instead of a single k we allow a different k n for each n-gram 
model being interpolated. 

Recall that higher-order models are defined recursively in terms of lower-order models, 
and that the recursion is ended by taking the unigram distribution to be the maximum 
likelihood distribution. While using the maximum likelihood unigram distribution works 
well in practice, this choice is not well-suited to our work. In practice, the vocabulary 
V is usually chosen to include only those words that occur in the training data, so that 
PMh(wi) > for all Wi 6 V. This assures that the probabilities of all n-grams are nonzero. 
However, in this work we do not satisfy the constraint that all words in the vocabulary 
occur in the training data. We run experiments using many training set sizes, and we use 
a fixed vocabulary across all runs so that results between sizes are comparable. Not all 
words in the vocabulary will occur in the smaller training sets. Thus, unless we smooth 
the unigram distribution we may have n-gram probabilities that are zero, which could lead 
to an infinite cross-entropy on test data. To address this issue, we smooth the unigram 
distribution in Katz smoothing using additive smoothing; we call the additive constant SFH 

In the algorithm as described in the original paper, no probability is assigned to n-grams 
with zero counts in a conditional distribution p(wi\wlZ n+ i) if there are no n-grams wl__ n+l 
that occur between 1 and k n times in that distribution. This can lead to an infinite cross- 
entropy on test data. To address this, whenever there are no counts between 1 and k n in 
a conditional distribution, we give the zero-count n-grams a total of (3 counts, and increase 
the normalization constant appropriately. 



11 In Jelinek-Mercer smoothing, we address this issue by ending the model recursion with a Oth-order model 
instead of a unigram model, and taking the Oth-order model to be a uniform distribution. We tried a similar 
tack with Katz smoothing, but the natural way of interpolating a unigram model with a uniform model 
in the Katzian paradigm led to poor results. We tried additive smoothing instead, which is equivalent to 
interpolating with a uniform distribution using the Jelinek-Mercer paradigm, and this worked well. 
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Church-Gale Smoothing (church-gale) 

While Church and Gale use the maximum likelihood unigram distribution, we instead 
smooth the unigram distribution using Good- Turing (without bucketing) as this seems 
more consistent with the spirit of the algorithm. This should not affect performance much, 
as the unigram probabilities are used only for bucketing purposes.^ 

We use a different bucketing scheme than that described by Church and Gale. For 
a bigram model, they divide the range of possible values of p{wi-i)p(wi) into about 35 
buckets, with three buckets per factor of 10. However, this bucketing strategy is not ideal 
as bigrams are not distributed uniformly among different orders of magnitude. Furthermore, 
they provide analysis that indicates that they had sufficient data to distinguish between at 
least 1200 different probabilities to be assigned to bigrams with zero counts; this is evidence 
that using significantly more than 35 buckets might yield better performance. Hence, we 
chose to use wall of bricks bucketing as in our implementation of Jelinek- Mercer smoothing. 

We first do as Church and Gale do and partition the range of possible p(wi-\)p(wi) 
values using some constant number of buckets per order of magnitude, except instead of 
using a total of 35 buckets we use some very large number of buckets, c m b- Instead of calling 
these partitions buckets we call them minibuckets, as we lump together these minibuckets 
to form our final buckets using the wall of bricks technique. We group together minibuckets 
so that at least c m i n n-grams with nonzero count fall in each bucket. 

To smooth the counts n r needed for the Good- Turing estimate, we use the technique 
described by Gale and Sampson (1995| ). This technique assigns a total probability of jfr 



to n-grams with zero counts, as dictated by the Good- Turing estimate. However, it is 
possible that ri\ = N in which case no probability is assigned to nonzero counts. As this is 
unacceptable, we modify the algorithm so that in this case, we assign a total probability of 
Pm=N < 1 to zero counts. In addition, it is possible that n\ = in which case no probability 
is assigned to zero counts. In this case, we instead assign a total probability of p ni =o > 
to zero counts. 

Finally, the original paper describes only bigram smoothing in detail; extending this 
method to trigram models is ambiguous. In particular, it is unclear whether to bucket 
trigrams according to p(u£Z 2 )p(u>i) or p(w l ~_\)p(wi\Wi-i). We choose the former value; 
while the latter value may yield better performance as it is a better estimate of ff(w,-_?),P| 
our belief is that it is much more difficult to implement and it requires a great deal more 
computation. 



We outline the algorithm we use to construct a trigram model in Figure 2.2. The 



time complexity of the algorithm is roughly 0(c nz + c p ), where c nz denotes the number 



12 We observed an interesting phenomenon when smoothing the unigram distribution with Good- Turing. 
We construct a vocabulary V by collecting all words in a corpus that occur at least k times, for some value 
k. For training sets that include the majority of a corpus, there will be unnaturally few words occurring 
fewer than k times, since most of these words have been weeded out of the vocabulary. This results in n r 
that can yield odd corrected counts r* . For example, for a given cutoff k we may get nk 3> n^-i so that 
(k — 1)* is overly high. We have not found th at this phenomenon significantly affects performance. 



13 Referring to the analysis given in Section 2.2.5, the former choice roughly corresponds to interpolating 
the trigram model with a unigram model, while the latter choice corresponds to interpolating the trigram 
model with a bigram model. 
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; count how many trigrams with nonzero counts fall in each minibucket 
for each trigram w\_ 2 with c{w\_ 2 ) > do 

increment the count for the minibucket b m that w\_ 2 fells in 

group minibuckets b m into buckets b using the wall of bricks technique 

; calculate number of trigrams in each bucket by looping over all values of p{w t ~\)p{wi) 
; this is used later to calculate number of trigrams with zero counts in each bucket 
; the first two loops loop over all possible values ofp(w l ~_\), the third loop is for p(wi) 
for each bigram bucket 62 do 

for each count r 2 with nb 2 ,r 2 > do 

for each count r\ with n ri > in the unigram distribution do 
begin 

b := the bucket a trigram w\_ 2 falls in if c(w l i Z 2 \) = r 2 and c(wi) = r± 
increment the count for b by the number of trigrams w\_ 2 such that 

c(w\Z\) = r 2 and c(wi) = r±, i.e.\w l ~_\ : c(w\Z\) = r 2 \ x \wi : c(wi) = n 

end 

; calculate counts for each bucket b and count r > 
for each trigram w\_ 2 with c{w\_ 2 ) > do 

calculate the bucket b that w\_ 2 falls in, and increase n^^ for r = c(w\_ 2 ) 

calculate n^o by subtracting Y1T=1 n b,r from the total number of trigrams in b 

smooth the n^^ values using the Gale-Sampson algorithm 

; calculate normalization constants J2w- c gt{w\_ 2 ) for each w l ~\ 
for each bigram bucket b 2 do 

for each count r 2 with > do 

calculate normalization constant A^ 2jr2 for a bigram wlZ 2 in bucket 62 
with c{w\Z 2 ) = r 2 given that c{w\_ 2 ) = for all Wi 
for each bigram w l ~\ with c{w' i i Z 2 ) > do 

calculate its normalization constant ^2 W . cqt(vjI_ 2 ) by calculating its difference 
from Nb 2jT2 where wlZ 2 falls in bucket b 2 and c(w\z\) = r 2 ; this can be done 
by looping through all trigrams w\_ 2 with c(w\_ 2 ) > 



Figure 2.2: Outline of our Church-Gale trigram implementation 
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of trigrams with nonzero counts and c p denotes the number of different possible values of 
p( w \l\)p( w i)- The term c p comes from the fact that to calculate the total number of di- 
grams in a bucket b (which is needed to efficiently calculate nb t o), it is necessary to loop 
over all possible values oi p{w 1 i Z_2)p{ w i)- We take advantage of the fact that the number 
of possible values for p{w\Z^) is a t most the total number of different bigram counts in 
each bigram bucket \n^ r : n^ r > 0, b a bigram bucket |, as each (b, r) pair corresponds to 
a potentially different corrected count that can be assigned to a bigram. Similarly, the 
number of possible values for p(u>i) is at most the total number of different unigram counts 
\n r : n r > 0,for the unigram distribution! . 

Now, consider the analogous algorithm except bucketing using p(w l ~_2)p{ w i\ w i-i)- The 
factor in c p imm. p{w l ~_\) remains the same, but the number of different values for p(wi\v)i—\) 
is much larger than the number of different values for p(wi). The number of different values 
for p(wi\wi-x) is roughly equal to the number of different bigrams with nonzero counts, 
while the number of different values for p{wi) is at most the number of different unigram 
counts. Thus, this alternate bucketing scheme is much more expensive computationally. 



Novel Smoothing Methods (new-avg-count, new-one-count) 



The implementation of smoothing method average- count, new-avg-count, is identical to 



interp-held-out except that instead of bucketing the A^ 

K i — n + 1- 1 



-i according to £ c{w\_ n+l ), 



we bucket according to 



as described in Section 2.3.1. 



K:c(w]_ n+1 )>0| 

In the implementation of smoothing method one-count, new-one-count, we have dif- 
ferent parameters j3 n and 7„ in equation ( 2.12 ) for each n-gram model being interpolated. 
Also, recall that higher-order models are defined recursively in terms of lower-order models. 
We end the recursion by taking the Oth-order distribution to be the uniform distribution 

PunifOj) = l/\V\- 



Baseline Smoothing (interp-baseline) 

For our baseline smoothing method, we use Jelinek-Mercer smoothing with held-out inter- 
polation where for each n-gram model being interpolated we constrain all A »-i in the 

i— n+l 

model to be equal to a single value A n , i.e., 

Pbase(Wi\wlZn +1 ) = K PUh(m Kln+l) + ( 1 ~ ^n) Pbase (Wi\w\Z^ +2 )• 

This is identical to interp-held-out where c m i n is set to oo, so that there is only a single 
bucket for each n-gram model. 



2.4.2 Implementation Architecture 

In this section, we give an overview of the entire implementation. The coding was done in 
C++. Each of the implementations of the individual smoothing techniques were linked into 
a single program, to help ensure uniformity in the methodology used with each smoothing 
technique. 
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For large training sets, it is difficult to fit an entire bigram or trigram model into a 
moderate amount of memory. Thus, we chose not to do the straightforward implementation 
of first building an entire smoothed n-gram model in memory and then evaluating it. 

Instead, notice that for a given test set, it is only necessary to build that part of the 
smoothed n-gram model that is applicable to the test set. To take advantage of this ob- 
servation, we first process the training set by taking counts of all n-grams up to the target 
n, and we sort these n-grams into an order suitable for future processing. We then iterate 
through these n-gram counts, extracting those counts that are relevant for evaluating the 
test data (or the held-out data used to optimize parameter values). We use these extracted 
counts to build the smoothed n-gram model on only the relevant data. For some smoothing 
algorithms (katz and church-gale), it is necessary to make additional passes through the 
n-gram counts to collect other statistics. 

In some experiments, we use very large test sets; in this case the above algorithm is 
not practical as a too large fraction of the total model is needed to evaluate the test data. 
For these experiments, we process the test data in the same way as the training data, by 
taking counts of all relevant n-grams and sorting them. We then iterate through both the 
training n-gram counts and test n-gram counts simultaneously, repeatedly building a small 
section of the smoothed n-gram model, evaluating it on the associated test data, and then 
discarding the partial model. 

In our implementation, we include a general multidimensional search engine for auto- 
matically searching for optimal parameter values for each smoothing technique. We use the 
implementation of Powell's search algorithm ( Brent, 1973] ) given in Numerical Recipes in C 
(Press et al, 1988, pp. 309-317). Powell's algorithm does not require the calculation of the 
gradient. It involves successive searches along vectors in the multidimensional search space. 



2.4.3 Data 

We used the Penn treebank and TIPSTER corpora distributed by the Linguistic Data 
Consortium. From the treebank, we extracted text from the tagged Brown corpus, yielding 
about one million words. From TIPSTER, we used the Associated Press (AP), Wall Street 
Journal (WSJ), and San Jose Mercury News (SJM) data, yielding 123, 84, and 43 million 
words respectively. We created two distinct vocabularies, one for the Brown corpus and 
one for the TIPSTER data. The former vocabulary contains all 53,850 words occurring in 
Brown; the latter vocabulary consists of the 65,173 words occurring at least 70 times in 
TIPSTER. 

For each experiment, we selected three segments of held-out data along with the segment 
of training data. These four segments were chosen to be adjacent in the original corpus 
and disjoint, the held-out segments preceding the training to facilitate the use of common 
held-out data with varying training data sizes. The first held-out segment was used as 
the test data for performance evaluation, and the other two held-out segments were used 
as development test data for optimizing the parameters of each smoothing method. In 
experiments with multiple runs on the same training data size, the data segments of each 
run are completely disjoint. 
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Figure 2.3: Performance relative to baseline method of katz and new-avg-count with 
respect to parameters S and c m i n , respectively, over several training set sizes 



Each piece of held-out data was chosen to be roughly 50,000 words. This decision does 
not reflect practice well. For example, if the training set size is less than 50,000 words then 
it is not realistic to have this much development test data available. However, we made 
this choice to prevent us having to optimize the training versus held-out data tradeoff for 
each data size. In addition, the development test data is used to optimize typically very 
few parameters, so in practice small held-out sets are generally adequate, and perhaps can 
be avoided altogether with techniques such as deleted estimation. 



2.4.4 Parameter Setting 



In Figure 2J3, we show how the values of the parameters 5 and c m ; n affect the performance 
of methods katz and new-avg-count, respectively, over several training data sizes. Notice 
that poor parameter setting can lead to very significant losses in performance. In Figure 



2.3, we see differences in entropy from several hundredths of a bit to over a bit. Also, 
we see that the optimal value of a parameter varies with training set size. Thus, it is 
important to optimize parameter values to meaningfully compare smoothing techniques, 
and this optimization should be specific to the given training set size. 

In each experiment we ran except as noted below, optimal values for the parameters of 
the given method were searched for using Powell's search algorithm. Parameters were chosen 
to optimize the cross-entropy of the first of the two development test sets associated with 
the given training set. For katz and church-gale, we did not perform the parameter search 
for training sets over 50,000 sentences due to resource constraints, and instead manually 
extrapolated parameter values from optimal values found on smaller data sizes. 

For instances of Jelinek- Mercer smoothing, the A's were trained using the Baum- Welch 
algorithm on the second development test set; all other parameters were optimized using 
Powell's algorithm on the first development test set. More specifically, to evaluate the en- 
tropy associated with a given set of (non-A) parameters in Powell's search, we first optimize 
the A's on the second test set. 
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To constrain the parameter search in our main battery of experiments, we searched 
only those parameters that were found to affect performance significantly, as indicated 
through preliminary experiments over several data sizes. In each run of these preliminary 
experiments, we fixed all parameters but one to some reasonable value, and used Powell's 
algorithm to search on the single free parameter. We recorded the entropy of the test data 
for each parameter value considered by Powell's algorithm. If the range of test data entropy 
over this search was much smaller than the typical difference in entropies between different 
algorithms, we considered it safe not to perform the search over this parameter in the later 
experiments. For each parameter, we tried three different training sets: 20,000 words from 
the WSJ corpus, 1M words from the Brown corpus, and 3M words from the WSJ corpus. 

We assumed that all parameters are significant for the methods plus-one, plus-lambda, 
and new-one-count. We describe the results for the other algorithms below. 

Jelinek-Mercer Smoothing (interp-held-out, interp-del-int, new-avg-count, 
interp-baseline) 

The parameter Ao, the initial value of the A's in the Baum- Welch search, affected entropy by 
less than 0.001 bits. Thus, we decided not to search over this parameter in later experiments. 
We fix A to be 0.5. 

The parameter 6 s t op , controlling when to terminate the Baum- Welch search, affected 
entropy by less than 0.002 bits. We fix <5 stop to be 0.001 bits. 

The parameter c top , the top count considered in bucketing, affected entropy by up to 
0.006 bits, which is significant. However, we found that in general the entropy is lower for 
higher values of c top ; this range of 0.006 bits is mainly due to setting ct op too low. We fix 
c top to be a fairly large value, 100,000. 

The parameter c m ; n , the minimum number of counts in each bucket, affected entropy 
by up to 0.07 bits, which is significant. Thus, we search over this parameter in later 
experiments. 



Katz Smoothing (katz) 

The parameters k n , specifying the count above which counts are not discounted, affected 
entropy by up to 0.01 bits, which is significant. However, we found that the larger the k n , 
the better the performance. In Figure 2.4, we display the entropy on the Brown corpus for 
different values of k\, &2, and k^. However, for large k there will be counts r such that the 
associated discount ratio d r takes on an unreasonable value, such as a nonpositive value or 
a value above one. We take k n to be as large as possible such that the d r take on reasonable 
values. 

The parameter /3, describing how many counts are given to n-grams with zero counts if 
no counts in a distribution are discounted, affected the entropy by less than 0.001 bits. We 
fix p to be 1. 

The parameter 8, the constant used for the additive smoothing of the unigram distri- 
bution, affected entropy by up to 0.02 bits, which is significant. Thus, we search over this 
parameter in later experiments. For large training sets (over 50,000 sentences), we do not 
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Figure 2.4: Effect of k n on Katz smoothing 




Figure 2.5: Effect of c m i n and c m b on Church-Gale smoothing 



perform the search due to time constraints. Instead, we choose its value by manually extrap- 
olating from the optimal values found on smaller training sets. For example, for TIPSTER 
we found that 5 = 0.0011 x l s J fits the optimal values found for smaller training sets well, 
where Is denotes the number of sentences in the training data. 

Church-Gale Smoothing (church-gale) 

The parameter p ni= o, the probability assigned to zero counts if there are no one-counts in 
a distribution, affected the entropy not at all. We fix p ni= o to be 0.01. 

The parameter p ni= N, the probability assigned to zero counts if all counts in a distri- 
bution are one-counts, affected the entropy by up to 0.2 bits. Thus, we search over this 
parameter in later experiments. However, for training sets over 50,000 sentences, due to 
time constraints we do not perform parameter search for church-gale. We noticed that 
for larger training sets this parameter does not seem to have a large effect (0.002 bits on 
the 3M words of WSJ), and the optimal value tends to be very close to 1. Thus, for large 
training sets we fix p ni =N to be 0.995. 
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The parameters c m i n , the minimum number of counts per bucket, and c m b, the number 
of minibuckets, both affected the entropy a great deal (over 0.5 bits). Thus, we search 
over these parameters in later experiments. However, the search space for both of these 
parameters is very bumpy, so it is unclear how effective the search process is. In Figure 
2.5 , we display the entropy on test data for various values of c m i n and c m b when training 
on 3M words of WSJ. The search algorithm will find a local minimum, but we will have no 
guarantee on the global quality of this minimum given the nature of the search space. 

As mentioned above, for training sets over 50,000 sentences, due to time constraints we 
do not perform parameter search for church-gale. Fortunately, for larger training sets c m ; n 
and c m b seem to have a smaller effect (0.07 and 0.03 bits, respectively, on the 3M words 
of WSJ). For training sets over 50,000 sentences, we just guess reasonable values for these 
parameters: we fix c m \ n to be 500 and c m b to be 100,000. For very large data sets, due to 
memory constraints we tcike c m j n to be to limit the number of buckets created. 



2.5 Results 

In this section, we present the results of our experiments. First, we present the performance 
of various algorithms for different training set sizes on different corpora for both bigram and 
trigram models. We demonstrate that the relative performance of smoothing methods varies 
significantly over training sizes and n-gram order, and we show which methods perform best 
in different situations. We find that katz performs best for bigram models produced from 
moderately-sized data sets, church-gale performs best for bigram models produced from 
large data sets, and our novel methods new-avg-count and new-one-count perform best 
for trigram models. 

Then, we present a more detailed analysis of performance, rating different techniques 
on how well they perform on n-grams with a particular count in the training data, e.g., 
n-grams that have occurred exactly once in the training data. We find that katz and 
church-gale most accurately smooth n-grams with large counts, while new-avg-count and 
new-one-count are best for small counts. We then show the relative impact on performance 
of small counts and large counts for different training set sizes and n-gram orders, and 
use this data to explain the variation in performance of different algorithms in different 
situations. 

Finally, we examine three miscellaneous points: the accuracy of the Good- Turing es- 
timate in smoothing n-grams with zero counts, how Church-Gale smoothing compares to 
linear interpolation, and how deleted interpolation compares with held-out interpolation. 



2.5.1 Overall Results 



In Figure 2.6, we display the performance of the interp-baseline method for bigram and 
trigram models on TIPSTER, Brown, and the WSJ subset of TIPSTER. In Figures ^jj - 
2.1C , we display the relative performance of various smoothing techniques with respect to 
the baseline method on these corpora, as measured by difference in entropy. In the graphs 
on the left of Figures 2.6 2^q , each point represents an average over ten runs; the error bars 
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Figure 2.6: Baseline cross-entropy on test data; graph on left displays averages over ten 
runs for training sets up to 50,000 sentences, graph on right displays single runs for training 
sets up to 10,000,000 sentences 




Figure 2.7: Trigram model on TIPSTER data; relative performance of various methods 
with respect to baseline; graphs on left display averages over ten runs for training sets up 
to 50,000 sentences, graphs on right display single runs for training sets up to 10,000,000 
sentences; top graphs show all algorithms, bottom graphs zoom in on those methods that 
perform better than the baseline method 
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Figure 2.8: Bigram model on TIPSTER data; relative performance of various methods 
with respect to baseline; graphs on left display averages over ten runs for training sets up 
to 50,000 sentences, graphs on right display single runs for training sets up to 10,000,000 
sentences; top graphs show all algorithms, bottom graphs zoom in on those methods that 
perform better than the baseline method 




Figure 2.9: Bigram and trigram models on Brown corpus; relative performance of various 
methods with respect to baseline 
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Figure 2.10: Bigram and trigram models on Wall Street Journal corpus; relative performance 
of various methods with respect to baseline 



represent the empirical standard deviation over these runs. Due to resource limitations, 
we only performed multiple runs for data sets of 50,000 sentences or less. Each point on 
the graphs on the right represents a single run, but we consider training set sizes up to 
the amount of data available, e.g., up to 250M words on TIPSTER. The graphs on the 
bottom of Figures 2.7-2.8 are close-ups of the graphs above, focusing on those algorithms 
that perform better than the baseline. We ran interp-del-int only on training sets up 
to 50,000 sentences due to time constraints. To give an idea of how these cross-entropy 
differences translate to perplexity, each 0.014 bits correspond roughly to a 1% change in 
perplexity. 

From these graphs, we see that additive smoothing performs poorly and that the meth- 
ods katz and interp-held-out consistently perform well, with katz performing the best 
of all algorithms on small bigram training sets. The implementation church-gale performs 
poorly except on large bigram training sets, where it performs the best. The novel meth- 
ods new-avg-count and new-one-count perform well uniformly across training data sizes, 
and are superior for trigram models. Notice that while performance is relatively consistent 
across corpora, it varies widely with respect to training set size and n-gram order. 



2.5.2 Count-by-Count Analysis 

To paint a more detailed picture of performance, we consider the performance of different 
models on only those n-grams in the test data that have exactly r counts in the training 
data, for small values of r. This analysis provides information as to whether a model assigns 
the correct amount of probability to categories such as n-grams with zero counts, n-grams 
with low counts, or n-grams with high counts. In these experiments, we use about 10 million 
words of test data. 

First, we consider whether various smoothing methods assign on average the correct 
discounted count r* for a given count r. The discounted count r* generally varies for 
different n-grams; we only consider its average value here. (Recall that the corrected count 
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Figure 2.11: Average corrected counts for bigram and trigram models, 1M words training 
data 




Figure 2.12: Average corrected counts for bigram and trigram models, 200M words training 
data 
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of an n-gram is proportional to the probability assigned by a model to that n-gram; in 
particular, in this discussion we assume that the probability assigned to an n-gram w\_ n+1 
with r counts is just its normalized corrected count jj, where the normalization constant 
N is equal to the original total count J2w c ( w l-n+i)-) To calculate how closely a model 
comes to assigning the correct average r*, we compare the expected value of the number of 
times n-grams with r counts occur in the test data with the actual number of times these 
n-grams occur. When the expected and actual counts agree, this corresponds to assigning 
the correct average value of r*. 

We can estimate the actual correct average Tq for a given count r by using the following 
formula: 

# actual number of n-grams with r counts in the test data 

expected number according to the maximum likelihood model 

The maximum likelihood model represents the case where we take the corrected count to 



just be the original count. In Figure |2.11| , we display the desired average corrected count 
for each count less than 40, for 1M words of training data from TIPSTER. 

The last point in the graph corresponds to the average discount for that count and all 
higher counts. (This property holds for later graphs, that the last point corresponds to that 
count and all higher counts.) The solid line corresponds to the maximum likelihood model 
where the corrected count is taken to be equal to the original count. In Figure 
display the same graph except for 200M words of training data from TIPSTER. 



we 



In Figures 2.13) and 2.14, we display how close various smoothing methods came to 



the desired average corrected count, again using 1M and 200M words of training data 
from TIPSTER. For each model, we graph the ratio of the actual average corrected count 
assigned by the model to the ideal average corrected count. For the zero count case, we 
exclude those n-grams w\_ n+1 that occur in distributions that have a total of zero counts, 
*- e -> Ylw c ( w i-n+i) = 0- For these n-grams, the corrected count should be zero since the 
total count is zero. (These n-grams are also excluded in later graphs.) 

We see that all of the algorithms tested tend to assign slightly too little probability 
to n-grams with zero counts, and significantly too much probability to n-grams with one 
count. For high counts, the algorithms tend to assign counts closer to the correct average 
count on the larger training set than on the smaller training set. This effect did not hold 
for low counts. 

The algorithms katz and church-gale consistently come closest to assigning the correct 
average amount of probability to larger counts, with katz doing especially well. Thus, we 
conclude that the Good- Turing estimate is a useful tool for accurately estimating the desired 
average corrected count, as both Katz and Church-Gale smoothing use this estimate.^ 

In contrast, we see that methods involving linear interpolation are not as accurate on 
large counts, overdiscounting large counts in three of the experiments, and underdiscounting 
large counts in the 1M word bigram experiment. Roughly speaking, linear interpolation 
corresponds to linear discounting; that is, a corrected count r* is about A times the original 



This only applies to Katz if a large k is used, as counts above k are not discounted. 



40 



interp-baseline - 
church-gale - 
interp- held -out -i 
katz 

new-avg- count 
new -one-count 




church-gale - 
.terp- held -out -i 
katz 
■a vg- count 
new -one-count 




25 30 35 



Figure 2.13: Expected over actual counts for various algorithms, bigram and trigram models, 
1M words training data 
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Figure 2.14: Expected over actual counts for various algorithms, bigram and trigram models, 
200M words training data 
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Figure 2.15: Relative performance at each count for various algorithms, bigram and trigram 
models, 1M words training data 
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Figure 2.16: Relative performance at each count for various algorithms, bigram and trigram 
models, 200M words training data 



count r in Jelinek- Mercer smoothing. Referring to Figures [2.11 and 2.12, it is clear that the 



desired average corrected count is not a constant multiplied by the original count; a more 
accurate description is fixed discounting, that the corrected count is the original count less 
a constant. 

The above analysis only considers whether an algorithm yields the desired average cor- 
rected count; it does not provide insight into whether an algorithm varies the corrected 
count in different distributions in a felicitous manner. For example, the Good- Turing esti- 
mate predicts that one should assign a total probability of ^ to n-grams with zero counts; 



obviously, this value varies from distribution to distribution. In Figures 2.15 and 2.16, we 
display a measure that we call bang-for-the-buck that reflects how well a smoothing algo- 
rithm varies the corrected count r* of a given count r in different distributions. To explain 
this measure, we first consider the measure of just taking the total entropy assigned in the 
test data to n-grams with a given count r. Presumably, the smaller the entropy assigned 
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Figure 2.17: Fraction of entropy devoted to various counts over many training sizes, baseline 
smoothing, bigram and trigram models 

to these re-grams, the better a smoothing algorithm is at estimating these corrected counts. 
However, an algorithm that assigns a higher average corrected count will tend to have a 
lower entropy. We want to factor out this effect, and we do this by normalizing the average 
corrected count of each algorithm to the same value before calculating the entropy. We 
call this measure bang-for-the-buck as it reflects the relative performance of each algorithm 



given that they all assign the same amount of probability to a given count. In Figures 2.15 



and 2.16, we display the bang-for-the-buck per word of various algorithms relative to the 
baseline method; as this score is an entropy value, the lower the score, the better. 

For larger counts, Katz and Church-Gale yield superior bang-for-the-buck. We hypoth- 
esize that this is because the linear discounting used by other methods is a poor way to 
discount large counts. On small nonzero counts, Katz smoothing does relatively poorly. 
We hypothesize that this is because Katz smoothing does not perform any interpolation 
with lower-order models for these counts, while other methods do. It seems likely that 
lower-order models still provide useful information if counts are low but nonzero. The best 
methods for modeling zero counts are our two novel methods. 

The method church-gale performs especially poorly on zero counts in trigram models. 



This can be attributed to the implementation choice discussed in Section 15. We chose to 
implement a version of the algorithm analogous to interpolating the trigram model directly 
with a unigram model, as opposed to a version analogous to interpolating the trigram model 
with a bigram model. (As discussed, it is unclear whether the latter version is practical.) 
Given the above analysis, it is relevant to note what fraction of the total entropy of 



the test data is associated with n-grams of different counts. In Figure 2.17 , we display 
this information for different training set sizes for bigram and trigram models. A line 
labelled r < k graphs the fraction of the entropy devoted to n-grams with up to k counts. 
For instance, the region below the lowest line is the fraction of the entropy devoted to zero 
counts (excluding those zero counts that occur in distributions with a total of zero counts; as 
mentioned before we treat these n-grams separately). The fraction of the entropy devoted 
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Figure 2.18: Average count assigned to n-grams with zero count for various n\ and N, 
actual, bigram and trigram models 




Figure 2.19: Average count assigned to n-grams with zero count for various n\ and N, 
predicted by Good- Turing 



to zero count n-grams occurring in zero count distributions is represented by the region 
above the top line in the graph. 

This data explains some of the variation in the relative performance of different algo- 
rithms over different training set sizes and between bigram and trigram models. Our novel 
methods get most of their performance gain relative to other methods from their perfor- 
mance on zero counts. Because zero counts are more frequent in trigram models, our novel 
methods perform especially well on these models. Furthermore, because zero counts are less 
frequent in large training sets, our methods do not do as well from a relative perspective on 
larger data. On the other hand, Katz smoothing and Church-Gale smoothing do especially 
well on large counts. Thus, they yield better performance on bigram models and on large 
training sets. 
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2.5.3 Accuracy of the Good- Turing Estimate for Zero Counts 

Because the Good- Turing estimate is a fundamental tool in smoothing, it is interesting to 
test its accuracy empirically. In this section, we describe experiments investigating how well 
the Good- Turing estimate assigns probabilities to n-grams with zero counts in conditional 
bigram and trigram distributions. We consider zero counts in particular because zero-count 



n-grams contribute a very sizable fraction of the total entropy, as shown in Figure 2.17. 



The Good- Turing estimate predicts that the total probability assigned to n-grams with 
zero counts should be t-t, the number of one-counts in a distribution divided by the total 
number of counts in the distribution. In terms of corrected counts, this corresponds to 
assigning a total of n% counts to n-grams with zero counts. □ We can calculate the desired 



average corrected count for a given n\ and N by using a similar analysis as in Section 2.5.2 



comparing the expected number and actual number of zero-count n-grams in test data. In 



Figure 2T8 , we display the desired total number of corrected counts assigned to zero counts 
for various values of n\ and N, for 1M words of TIPSTER training data. 

Each line in the graph corresponds to a different value of n\. The x-axis corresponds to 
N, and the y-axis corresponds to the desired count to assign to n-grams with zero counts. 
If the Good- Turing estimate were exactly accurate, then we would have a horizontal line 



for each m at the level y = n\, as displayed in Figure 2.19| . However, we see that the lines 



are not very horizontal for smaller N, and that asymptotically they seem to level out at a 
value significantly larger than n\. 

We hypothesize that this is because the assumption made by Good- Turing that suc- 
cessive n-grams are independent is incorrect. The derivation of the Good- Turing estimate 
relies on the observation that for an event with probability p, the probability that it will 
occur r times in N trials is (^)p r (l — p) N ~ r . However, this only holds if each trial is inde- 
pendent. Clearly, language has decidedly clumpy behavior. For example, a given word has 
a higher chance of occurring given that it has occurred recently.^ Thus, the actual number 
of one-counts is probably lower than what would be expected if independence were to hold, 
so the probability given to zero counts should be larger than 

2.5.4 Church-Gale Smoothing versus Linear Interpolation 

Church-Gale smoothing incorporates the information from lower-order models into higher- 
order models through its bucketing mechanism, unlike other smoothing methods that use 
linear interpolation. In this section, we present empirical results on how these two different 
techniques compare. 

First, we compare how Church-Gale smoothing and linear interpolation assign corrected 
counts to zero counts. In Figure [2.20 , we present the corrected count of zero counts for 



each bucket in a Church-Gale run. In linear interpolation, the corrected count of an n- 
gram with zero counts is proportional to the probability assigned to that n-gram in the 



3 As in Section 2.3.1 



we use ni and N to refer to counts in a conditional distribution p(wi\w]_l l+1 ) for a 



fixed w|_ n , 1 , as opposed to counts in the global n-gram distribution. 

16 This observation is taken advantage of in dynamic language modeling (Kuhn, 1988 
Huang, 199^ ; [Rosenfeld, 1994ej ). 
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Figure 2.20: Corrected count assigned to zero counts by Church-Gale for all buckets, bigram 
and trigram models 





Figure 2.21: Corrected count assigned to various counts by Church-Gale for all buckets, 
bigram and trigram models 
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Figure 2.22: Held-out versus deleted interpolation on TIPSTER data, relative performance 
with respect to baseline, bigram and trigram models 



next lower-order model. The x-axis of the graph represents the value used for bucketing in 
Church-Gale, which is proportional to the probability assigned to an n-gram by the next 
lower-order model. Thus, if Church-Gale smoothing assigns corrected counts to n-grams 
with zero counts similarly to linear interpolation, the graph will be a line with slope 1 (given 
that both axes are logarithmic). The actual graph is not far from this situation; the solid 
lines in the graphs are lines with slope 1. Thus, even though Church-Gale smoothing is 
very far removed from linear interpolation on the surface, for zero counts their behaviors 
are rather similar. 



In Figure [2.21| , we display the corrected counts for Church-Gale smoothing for n-grams 
with larger counts. For these counts, the curves are very different from what would be 
yielded with linear interpolation. (The shapes of the curves consistent with linear inter- 
polation are different from those found in the previous figure because the y-axis is linear 
instead of logarithmic scale.) 



2.5.5 Held-out versus Deleted Interpolation 

In this section, we compare the held-out and deleted interpolation variations of Jelinek- 



Mercer smoothing. Referring to Figure |2,22| , we notice that the method interp-del-int 
performs significantly worse than interp-held-out on TIPSTER data, though they differ 
only in that the former method uses deleted interpolation while the latter method uses 
held-out interpolation. Similar results hold for the other corpora, as shown in the earlier 
Figures 

However, the implementation interp-del-int does not completely characterize the 
technique of deleted interpolation as we do not vary the size of the chunks that are deleted. 
In particular, we made the choice of deleting only a single word at a time for implementation 
ease; we hypothesize that deleting larger chunks would lead to more similar performance to 
interp-held-out. 

As mentioned earlier, language tends to have clumpy behavior. Held-out data external 
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to the training data will tend to be more different from the training data than data that is 
deleted from the middle of the training data. As our evaluation test data is also external 
to the training data (as is the case in applications), A's trained from held-out data should 
better characterize the evaluation test data. However, the larger the chunks deleted in 
deleted interpolation, the more the deleted data behaves like held-out data. For example, 
if we delete half of the data at a time, this is very similar to the held-out data situation. 
Thus, larger chunks should yield better performance than that achieved by deleting one 
word at a time. 

However, for large training sets the computational expense of deleted interpolation be- 
comes a factor. In particular, the computation required is linear in the training data size. 
For held-out interpolation, the computation is linear in the size of the held-out data. Be- 
cause there are relatively few A's, these parameters can be trained reliably using a fairly 
small amount of data. Furthermore, for large training sets it matters little that held-out 
interpolation requires some data to be reserved for training A's while in deleted interpola- 
tion no data needs to be reserved for this purpose. Thus, for large training sets, held-out 
interpolation seems the sensible choice. 



2.6 Discussion 



Smoothing is a fundamental technique for statistical modeling, important not only for lan- 
guage modeling but for many other applications as well, e.g., prepositional phrase attach- 
ment ( Collins and Brooks, 1995| ), part-of-speech tagging ( Church, 1988|) , and stochastic 
parsing ( Magerman, 1994 ). Whenever data sparsity is an issue (and it always is), smooth- 
ing has the potential to improve performance with moderate effort. Thus, thorough studies 
of smoothing can benefit the research community a great deal. 

To our knowledge, this is the first empirical comparison of smoothing techniques in 
language modeling of such scope: no other study has systematically examined multiple 
training data sizes, corpora, or has performed parameter optimization. We show that in 
order to completely characterize the relative performance of two techniques, it is necessary 
to consider multiple training set sizes and to try both bigram and trigram models. We show 
that sub-optimal parameter selection can also significantly affect relative performance. 

Multiple runs should be performed whenever possible to discover whether any calculated 
differences are statistically significant; it is unclear whether previously reported results in 
the literature are reliable given that they are based on single runs and given the variances 
found in this work. For example, we found that the standard deviation of the average 
performance of Katz smoothing relative to the baseline method is about 0.005 bits for ten 
runs. Extrapolating to a single run, we expect a standard deviation of about y/lQ x 0.005 ~ 
0.016 bits, which translates to about a 1% difference in perplexity. In the Nadas and 
Katz papers, differences in perplexity between algorithms of about 1% are reported for a 
single test set of 100 sentences. MacKay and Peto present perplexity differences between 
algorithms of significantly less than 1%. 

Of the techniques studied, we have found that Katz smoothing performs best for bigram 
models produced from small training sets, while Church-Gale performs best for bigram 
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models produced from large training sets. This is a new result; Church-Gale smoothing has 
never previously been empirically compared with any of the popular smoothing techniques 
for language modeling. Our novel methods average- count and one-count are superior for 
trigram models and perform well in bigram models; method one-count yields marginally 
worse performance but is extremely easy to implement. 

Furthermore, we provide a count-by-count analysis of the performance of different 
smoothing techniques. By analyzing how frequently different counts occur in a given do- 
main, we can make rough predictions on the relative performance of different algorithms 
in that domain. For example, this analysis lends insight into how different algorithms will 
perform on training sizes and n-gram orders other than those we tested. 

However, it is extremely important to note that in this work performance is measured 
solely through the cross-entropy of test data. This choice was made because it is fairly in- 
expensive to evaluate cross-entropy, which enabled us to run experiments of such scale. Yet 
it is unclear how entropy differences translate to differences in performance in real-world 
applications such as speech recognition. While entropy generally correlates with perfor- 
mance in applications, small differences in entropy have an unpredictable effect; sometimes 
a reduction in entropy can lead to an increase in application error-rate, e.g., as reported 
by Iyer et al. (1994 ). In other words, entropy by no means completely characterizes appli- 
cation performance. Furthermore, it is not unlikely that relative smoothing performance 
results found in one application will not translate to other applications. Thus, to accu- 
rately estimate the effect of smoothing in a given application, it is probably necessary to 
run experiments using that particular application. 

However, we can guess how smoothing might affect application performance by extrap- 
olating from existing results. For example, Isotani and Matsunaga (1994 ) present the error 
rate of a speech recognition system using three different language models. As they also 
report the entropies of these models, we can linearly extrapolate to estimate how much the 
differences in entropy typically found between smoothing methods affect speech recognition 
performance. In Table 2A, we list typical entropy differences found between smoothing 
methods, where the "best" methods refer to interp-held-out, katz, new-avg-count, and 
new-one-count. We also display how these entropy differences affect application perfor- 
mance as extrapolated from the Isotani data. The row labelled original lists the error rate 
of the model tested by Isotani and Matsunaga with the highest entropy; the lower rows list 
the extrapolated error rate if the model entropy were decreased by the prescribed amount. 
In Table |2.4j , we also display a similar analysis using data given by |Rosenfeld (1994b|) . This 
analysis suggests that smoothing does not matter much as long as one uses a "good" imple- 
mentation of one of the better algorithms, e.g., those algorithms that perform significantly 
better than the baseline; it is more likely that the differences between the best and worst 
algorithms are significant. 

We have found that it is surprisingly difficult to design a "good" implementation of 
an existing algorithm. Given the description of our implementations, it is clear that there 
are usually many choices that need to be made in implementing a given algorithm; most 
smoothing techniques are incompletely specified in the literature. For example, as pointed 
out in Section 2.4.1, in certain cases Katz smoothing as originally described can assign 
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Isotani and Matsunaga 



entropy 


sentence error rate 


decrease in error rate 


original 


48.7 




-0.05 bits 


48.2 


1.0% 


-0.10 bits 


47.6 


2.3% 


-0.15 bits 


46.7 


4.1% 



Rosenfeld 



entropy 


word error rate 


decrease in error rate 


original 


19.9 




-0.05 bits 


19.7 


1.0% 


-0.10 bits 


19.5 


2.0% 


-0.15 bits 


19.3 


3.0% 



0.05 bits > typical entropy difference between best methods 

0.10 bits ~ maximum entropy difference between best methods 

0.15 bits « typical entropy difference between best methods and baseline method 



Table 2.4: Effect on speech recognition performance of typical entropy differences found 
between smoothing methods 



probabilities of zero, which is undesirable as this leads to an infinite entropy. We needed to 
perform a fair amount of tuning for each algorithm before we guessed our implementation 
was a reasonable representative of the algorithm. Poor choices often led to very significant 
differences in performance. 

Finally, we point out that because of the variation in the performance of different 
smoothing methods and the variation in the performance of different implementations of 
the same smoothing method (e.g., from parameter setting), it is vital to specify the ex- 
act smoothing technique and implementation of that technique used when referencing the 
performance of an n-gram model. For example, the Katz and Nadas papers describe com- 
parisons of their algorithms with "Jelinek-Mercer" smoothing, but they do not specify the 
bucketing scheme used or the granularity used in deleted interpolation. Without this in- 
formation, it is impossible to determine whether their comparisons are meaningful. More 
generally, there has been much work comparing the performance of various models with 
that of n-gram models where the type of smoothing used is not specified, e.g., work by 
McCandless and Glass (1993| ) and Carroll (1995| ). Again, without this information we can- 
not tell if the comparisons are significant. 



2.6.1 Future Work 

Perhaps the most important work that needs to be done is to see how different smoothing 
techniques perform in actual applications. This would reveal how entropy differences relate 
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to performance in different applications, and would indicate whether it is worthwhile to 
continue work in smoothing, given the largest entropy differences we are likely to achieve. 
Also, as mentioned before smoothing is used in other language tasks such as prepositional 
phrase attachment, part-of-speech tagging, and stochastic parsing. It would be interesting 
to see whether our results extend to domains other than language modeling. 

Some smoothing algorithms that we did not consider that would be interesting to com- 
pare against are those from the field of data compression, which includes the subfield of 
text compression ( Bell et ai, 1990| ). However, smoothing algorithms for data compression 
have different requirements from those used for language modeling. In data compression, it 
is essential that smoothed models can be built extremely quickly and using a minimum of 
memory. In language modeling, these requirements are not nearly as strict. 

As far as designing additional smoothing methods that surpass existing techniques, there 
were many avenues that we did not pursue. Hybrid smoothing methods look especially 
promising. As we found different methods to be superior for bigram and trigram models, 
it may be advantageous to use different smoothing methods in the different n-gram models 
that are interpolated together. Furthermore, in our count-by-count analysis we found that 
different algorithms were superior on low versus high counts. Using different algorithms for 
low and high counts may be another way to improve performance. 
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Chapter 3 

Bayesian Grammar Induction for 
Language Modeling 

In this chapter, we describe a corpus-based induction algorithm for probabilistic context-free 



grammars ( Chen, 1995 ) that significantly outperforms the grammar induction algorithm 



introduced by Lari and Young ( 199C| ), the most widely-used algorithm for probabilistic 



grammar induction. In addition, it outperforms n-gram models on data generated with 
medium-sized probabilistic context-free grammars, though not on naturally-occurring data. 
Of the three structural levels at which we model language in this thesis, this represents 
work at the constituent level. 

3.1 Introduction 

While n-gram models currently yield the best performance in language modeling, they 
seem to have obvious deficiencies. For instance, n-gram language models can only capture 
dependencies within an n-word window, where currently the largest practical n for natural 
language is three, and many dependencies in natural language occur beyond a three-word 
window. In addition, n-gram models are extremely large, thus making them difficult to 
implement efficiently in memory-constrained applications. 

An appealing alternative is grammar-based language models. Grammar has long been 
the representation of language used in linguistics and natural language processing, and 
intuitively such models capture properties of language that n-gram models cannot. For 
example, it has been shown that grammatical language models can express long-distance 
dependencies ( Lari and Young, 1990| ; Rcsnik, 1992; Schabes, 1992). Furthermore, grammat- 



ical models have the potential to be more compact while achieving equivalent performance 
as n-gram models ( prown et al, 1992b ). To demonstrate these points, we introduce the 



grammar formalism we use, probabilistic context-free grammars (PCFG). 

3.1.1 Probabilistic Context-Free Grammars 

We first give a brief introduction to (non-probabilistic) context-free grammars ( phomsky 



1964). As mentioned in the introduction, grammars consist of rules that describe how 
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structures at one level of language combine to form structures at the next higher level. For 
example, consider the following grammar]] 

S -> NP VP 
VP -> V NP 
NP -» D N 

D — ► a | i/te 

N — > boat | cat | tree 

V — > /lit | missed 

This third through fifth rules state that a noun phrase can be composed of a determiner 
followed by a noun, a determiner may be formed by the words a or the, and a noun may be 
formed by the words boat, cat, or tree. Thus, we have that strings such as a cat or the boat 
are noun phrases. Applying the other rules in the grammar, we see that strings such as a 
boat missed the tree or the cat hit the boat are sentences. Grammars provide a compact and 
elegant way for representing a set of strings. 

The above grammar is considered context-free because there is a single symbol on the 
left-hand side of each rule; there are grammar formalisms that allow multiple symbols. 
The symbols at the lowest level of the grammar such as a, hit, and tree are called the 
terminal symbols of the grammar. In all of the grammars we consider, the terminal symbols 
correspond to words. The other symbols in the grammar such as S, NP, and D are called 
nonterminal symbols. 

For every grammar, a particular nonterminal symbol is chosen to be the sentential 
symbol. The sentential symbol determines the set of strings the grammar is intended to 
describe; a grammar is said to accept a string if the string forms an instance of the sentential 
symbol. For example, if in the above example we take the sentential symbol to be S, then 
the grammar accepts the string a cat hit the tree but not the string the tree. The sentential 
symbol is usually taken to be the symbol corresponding to the highest level of structure in 
the grammar, and in language domains it usually corresponds to the linguistic concept of a 
sentence. In this work, we always name the sentential symbol S and it is always meant to 
correspond to a sentence (as opposed to a lower- or higher-level linguistic structure). 

A probabilistic context-free grammar (ISolomonoff, 1959|) is a context-free grammar that 



not just describes a set of strings, but also assigns probabilities to these strings.^ A prob- 
ability is associated with each rule in the grammar, such that the sum of the probabilities 



1 The following abbreviations are used in this work: 

S = sentence 

VP = verb phrase 

NP = noun phrase 

D = determiner 

N = noun 

V = verb 

2 Actually, a probabilistic context-free grammar also assigns probabilities to strings not accepted by the 
grammar; this probability is just zero. 
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NP 



VP 



D N V NP 

a cat hit D N 
the tree 

Figure 3.1: Parse tree for a cat hit the tree 



of all rules expanding a given symbol is equal to one.0 This probability represents the 
frequency with which the rule is applied to expand the symbol on its left-hand side. For 
example, the following is a probabilistic context-free grammar: 



s - 


-» NP VP 


(1.0) 


VP - 


-» V NP 


(1.0) 


NP - 


-> D N 


(1.0) 


D - 


-+ a 


(0.6) 


D - 


-* the 


(0.4) 


N - 


-* boat 


(0.5) 


N - 


-> cat 


(0.3) 


N - 


-> tree 


(0.2) 


V - 


-» hit 


(0.7) 


V - 


-» missed 


(0.3) 



To explain how a probabilistic context-free grammar assigns probabilities to strings, we 
first need to describe how such a grammar assigns probabilities to parse trees. A parse tree 
of a string displays the grammar rules that are applied to form the sentential symbol from 



the string. For example, a parse tree of a cat hit the tree is displayed in Figure |3.1| , Each 
non-leaf node in the tree represents the application of a grammar rule. For instance, the 
top node represents the application of the S — * NP VP rule. The probability assigned to 
a parse is simply the product of the probabilities associated with each rule in the parse. 
The probability assigned to the parse in Figure [O] is 0.6 x 0.3 x 0.7 x 0.4 x 0.2 = 0.01008, 
the terms corresponding to the rules D —* a, N — ► cat, V — * hit, D — * the, and N — * tree, 
respectively. All other rules used in the parse have probability 1. The probability assigned 
to a string is the sum of the probabilities of all of its parses; it is possible for a sentence to 
have more than a single parse ^ 
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the S t he 



dog Jidog^ 

barks $ barks 



Figure 3.2: Parse of the dog barks using a bigram-equivalent grammar 

3.1.2 Probabilistic Context-Free Grammars and n-Gram Models 

In this section, we discuss the relationship between probabilistic context-free grammars and 
n-gram models. First, we note that n-gram models are actually instances of probabilistic 
context-free grammars. For example, consider a bigram model with probabilities p(wi\wi-i). 
This model can be expressed using a grammar with \T\ + 1 nonterminal symbols, where T 
is the set of all terminal symbols, i.e., the set of all words. We have the sentential symbol 
S and a nonterminal symbol S w for each word w G T. A symbol can be interpreted as 
representing the state of having the word w immediately to the left. The grammar consists 
of all rules 

S ->■ Wi S Wi (p(wi\w hos )) 
S mi _j -> Wi S Wi {p{wi\wi-x)) 
S^_! -> w eos (p(w eos \wi-i)) 

for Wi-i,Wi E T where w^ os and w eos are the beginning- and end-of-sentence tokens. The 
values in parentheses are the probabilities associated with each rule expressed in terms of 
the probabilities of the corresponding bigram model. This grammar assigns the identical 
probabilities to strings as the original bigram model. For example, consider the sentence 
the dog barks. The only parse of this sentence under the above grammar is displayed in 



Figure 3.2. The probability of a parse is the product of the probabilities of each rule used, 



and going from top to bottom we get 

p( the\ Wbos)p(dog\ the)p(barks\ dog)p(w eos \ barks) 

which is identical to the probability assigned by a bigram model. 

Not only can probabilistic context-free grammars model the same local dependencies as 
n-gram models, but they have the potential to model long-distances dependencies beyond 
the scope of n-gram models. To demonstrate this, consider the sentence 

John read the boy a story. 

In a trigram model, the word story is assumed to depend only on the phrase boy a. However, 

3 This assures (except for some pathological cases) that the probabilities assigned to strings sum to on e. 
4 A good introduction to probabilistic context-free grammars has been written by Jelinek et al. (1992). 



55 



there is a strong dependence between the words read and story. We can model this using 
the following grammar fragment: 



Sread 
VP read 
V read 



NP s r : a b i 

^ T read 

read 



VP 
NP 



read 
i-obj 
read 



NP 



d-obj 
read 



rNx story ^ story ^ story 

N s tory -» story 
NPw j - NP story 

The symbols with subscript read are symbols that we restrict to only occur in sentences with 
the main verb read. The symbols with subscript story are symbols that we restrict to only 
occur in noun phrases whose head word is story. The superscripts on the NP's represent 
the different roles a noun phrase can play in a sentence. The probability associated with the 
last rule represents the probability that the word story is the head of the direct object of 
the verb read; this captures the long-distance dependency present between these two words. 
Probabilistic context-free grammars can express dependencies between words arbitrarily far 
apart. 

Thus, we see that probabilistic context-free grammars are a more powerful formalism 
than n-gram models, and thus have the potential for superior performance. Furthermore, 
grammars also have the potential to be more compact than n-gram models while achieving 
equivalent performance, because grammars can express classing, or the grouping together 
of similar words. For example, consider the words corporal and sergeant. These words have 
very similar bigram behaviors: for most words w we have p(w\ corporal) ~ p(w\ sergeant) and 
p(corporal\w) ~ p(sergeant\w). However, in bigram models the probabilities associated with 
these two words are estimated completely independently. In a grammatical representation, 
it is possible instead to introduce a symbol, say A, that corresponds to both words, i.e., to 
have 

A — > sergeant \ corporal. 

We can then have a single set of bigram probabilities p{w\A) and p{A\w) for the symbol 
A, instead of a separate set for each word. Notice that this does not preclude having some 
bigram probabilities specific to either corporal or sergeant, in the cases their behavior differ. 
Because grammars can class together similar words, equivalent performance to n-gram 
models can be achieved with much smaller models flBrown et al. , 1992b] ). 

Notice that when we use the term grammar, we are talking of its formal meaning, 
i.e., a collection of rules that describe how to build sentences from words. This contrasts 
with the connotation of the term grammar in linguistics, of a representation that describes 
linguistically meaningful concepts such as noun phrases and verb phrases. The symbols 
in the grammars we consider do not generally have any relation to linguistic constituents. 
Thus, the grammars we consider would not be applicable to the task of parsing for natural 
language processing, where grammars are used to build structure useful for determining 
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sentence meaning]^] 



There have been attempts to use linguistic grammars for language modeling (Newell, 



1973; Woods et al, 1976). However, such attempts have been unsuccessful. These manually- 
designed grammars cannot approach the coverage achieved by algorithms that statistically 
analyze millions of words of text. Furthermore, linguistic grammars are geared toward 
compactly describing language; in language modeling the goal is to describe language in a 
probabilistically accurate manner. Models with large numbers of parameters like n-gram 
models are better suited to this task. 

In this work, our goal was to design an algorithm that induces grammars somewhere 
between the rich grammars of linguistics and the flat grammars corresponding to n-gram 
models, grammars that have the structure for modeling long-distance dependencies as well 
as the size for modeling specific n-gram-like dependencies. In addition, we desired the 
grammars to still be significantly more compact than comparable n-gram models. 

We produced a grammar induction algorithm that largely satisfied these goals. In ex- 
periments, it significantly outperforms the most widely-used grammar induction algorithm, 
the Lari and Young algorithm, and on artificially-generated corpora it outperforms n-gram 
models. However, on naturally occurring data n-gram models are still superior. The algo- 
rithm induces a probabilistic context-free grammar through a greedy heuristic search within 
a Bayesian framework, and it refines this grammar with a post-pass using the Inside-Outside 
algorithm. The algorithm does not require the training data to be manually annotated in 
any way.0 

3.2 Grammar Induction as Search 

Grammar induction can be framed as a search problem, and has been framed as such 
almost without exception in past research ( Angluin and Smith, 1983| ). The search space is 



taken to be some class of grammars; for example, in our work we search within the space of 
probabilistic context-free grammars. We search for a grammar that optimizes some quantity, 
referred to as the objective function. In grammar induction, the objective function generally 
contains a factor that reflects how accurately the grammar models the training data. 

Most work in language modeling, including n-gram models and the Inside-Outside al- 
gorithm, falls under the maximum likelihood paradigm. In this paradigm, the objective 
function is taken to be the likelihood or probability of the training data given the grammar. 



Probabilistic context-free grammars have been argued to be inappr opriate for modeling natural l anguage 



because they cannot model lexical dependencies as do n-gram models (Resnik, 199S ; Schabes, 1992). As we 
have shown that n-gram models are instances of probabilistic context-free grammars, this is obviously not 
strictly true. A more accurate statement is that the context-free grammars traditionally used for parsing 
are not appropriate for language modeling. These grammars typically have a small set of nonterminal 
symbols, e.g., { S, NP, VP, . . . }, and grammars with few nonterminal symbols cannot express many lexical 
dependencies. However, with expanded symbol sets it is possible to express these dependencies, e.g., as in 
the example given earlier in this section where we qualify nonterminal symbols with their head words. 

Som e grammar induction algor i thms require that the training data be annotated with parse tree infor- 



(i 



mation (Pereira and Schabes, 1992; Magerman, 1994). However, these algorithms tend to be geared toward 



parsing instead of language modeling. It is expensive to manually annotate data, and it is not practical to 
annotate the amount of data typically used in language modeling. 
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That is, we try to find the grammar 



G = argmaxp(0|G7) 
G 

where denotes the training data or observations. The probability of the training data is 
the product of the probability of each sentence in the training data, i.e., 

p(0\G) = f[p( 0i \G) 

i=l 

if the training data O is composed of the sentences {01, . . . , o„}. The probability of a 
sentence p(pi\G) is straightforward to calculate for a probabilistic grammar G. 

However, the optimal grammar under this objective function is one that generates only 
sentences in the training data and no other sentences. In particular, the optimal grammar 
consists exactly of all rules of the form S — ► o«, each such rule having probability c(oj)/n 
where c(oj) is the number of times the sentence Oj occurs in the training data. Obviously, 
this grammar is a poor model of language at large even though it assigns a high probability 
to the training data; this phenomenon is called overfitting the training data. 

In n-gram models and work with the Inside-Outside algorithm ( Lari and Young, 1990| ; 



Lari and Young, 1991; Pereira and Schabes, 1992), this issue is evaded because all of the 



models considered are of a fixed size, so that the "optimal" grammar cannot be expressed.^ 
However, in our work we do not wish to limit the size of the grammars considered. 

We can address this issue elegantly by using a Bayesian framework instead of a maximum 
likelihood framework. As touched on in Chapter [j], in the Bayesian framework one attempts 
to find the grammar G with highest probability given the data p(G\0), as opposed to the 
grammar that yields the highest probability of the data p(0\G) as in maximum likelihood. 
Intuitively, finding the most probable grammar is more correct than finding the grammar 
that maximizes the probability of the data. 

Looking at the mathematics, in the Bayesian framework we try to find 

G = argmaxp(G|0). 

G 

As it is unclear how to estimate p(G\0) directly, we apply Bayes' Rule and get 

G = arg max E^±f±E^fl = argmaxp(OIGWG) (3.1) 
G P(0) G 

where p(G) denotes the prior probability of a grammar G. The prior probability p(G) is 
supposed to reflect our a priori notion of how frequently the grammar G appears in the 
given domain. 

Notice that the Bayesian framework is equivalent to the maximum likelihood framework 
if we take p{G) to be a uniform distribution. However, it is mathematically improper to 



7 As seen in Chapter even though n-gram models cannot express the optimal grammar there is still a 
grave overfitting problem, which is addressed through smoothing. 
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have a uniform distribution over a countably infinite set, such as the set of all context-free 
grammars. We give an informal argument describing its mathematical impossibility and 
relate this to why the maximum likelihood approach tends to overfit training data. 

Consider selecting a context-free grammar randomly using a uniform distribution over 
all context-free grammars. Now, let us define a size for each grammar; for example, we 
can take the number of characters in the textual description of a grammar to be its size. 
Then, notice that for any value k, there is a zero probability of choosing a grammar of size 
less than k, since there are an infinite number of grammars of size larger than k but only 
a finite number of grammars smaller than k. Hence, in some sense the "average" grammar 
according to the uniform distribution is infinite in size, and this relates to why a uniform 
distribution is mathematically improper. In addition, this is related to why the maximum 
likelihood approach prefers overlarge, overfitting grammars, as the uniform prior assigns far 
too much probability to large grammars. 

Instead, we argue that taking a minimum description length (MDL) principle ( [Rissanen 



1978) prior is desirable. The minimum description length principle states that one should 
select a grammar G that minimizes the sum of 1(G), the length of the description of the 
grammar, and l(0\G), the length of the description of the data given the grammar. We will 
later give a detailed description of what these lengths mean; for now, suffice it to say that 
this corresponds to taking a prior of the form 

p(G) = 

where 1(G) is the length of the grammar G in bits. For example, we can take 1(G) to be 
the length of a textual description of the grammar. 

Intuitively, this prior is appealing because it captures the intuition behind Occam's 
Razor, that simpler (or smaller) grammars are preferable over complex (or larger) grammars. 
Clearly, the prior p(G) assigns higher probabilities to smaller grammars. However, this prior 
extends Occam's Razor by providing a concrete way to trade off the size of a grammar with 
how accurately the grammar models the data. In particular, we try to find the grammar 
that maximizes p(G)p(0\G). The term p(G) favors grammars that are small, and the term 
p(0\G) favors grammars that model the training data well. 

This preference for small grammars over large addresses the problem of overfitting. 
The optimal grammar under the maximum likelihood paradigm will be given a poor score 
because its prior probability p(G) will be very small, given its vast size. Instead, the optimal 
grammar under MDL will be a compromise between size and modeling accuracy. 

Because coding theory plays a key role in the future discussion, we digress at this point to 
introduce some basic concepts in the field. Coding theory forms the basis of the descriptions 
used in the minimum description length principle, and it can be used to tie together the 
MDL principle with the Bayesian framework. 

3.2.1 Coding 

Coding can be thought of as the study of storing information compactly. In particular, we 
are interested in representing information just using a binary alphabet, O's and l's, as is 
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necessary for storing information in a computer. Coding just describes ways of mapping 
information into strings of binary digits or bits in a one-to-one manner, so that the original 
information can be reconstructed from the associated bit string. 

For example, consider the task of coding the outcome of a coin flip. A sensible code 
is to map tails to the bit string 0, and to map heads to the bit string 1 (or vice versa). 
Another possibility is to map both heads and tails to 0. This is an invalid code, because it is 
impossible to reconstruct whether a coin flip was heads or tails from the yielded bit string. 
In general, distinct outcomes must be mapped to distinct bit strings; that is, mappings need 
to be one-to-one. Another possible code is to map heads to the bit string 00, and to map 
tails to the bit string 11. This is a valid code, but it is inefficient as it codes the information 
using two bits when one will do. 

Codes can be used to store arbitrarily complex information. For example, the ASCII 
convention maps letters of the alphabet to eight-bit strings. In this convention, the text 
hi would be mapped to the sixteen-bit string 0110100001101001. By using the ASCII 
convention, any data that can be expressed through text can be mapped to bit strings. 

For obvious reasons, coding theory is concerned with finding ways to code information 
with as few bits as possible. For example, coding theory is at the core of the field of data 
compression. We now describe how to code data optimally, moving from simple examples 
to more complex ones. 

Fixed-Length Coding 

First, consider the case of coding the outcome of a single coin flip, which we showed earlier 
can be coded using a single bit. There are two possible outcomes to a coin flip, and there 
are two possible values for a single bit, so it is possible to find a one-to-one mapping from 
outcomes to bit values. Now, consider coding the outcome of k coin flips. Intuitively, this 
should be codable using k bits, and in fact it can. There are 2 k possible outcomes to k coin 
flips, and there are 2 k possible values of a fc-bit string, so again we can find a one-to-one 
mapping. In general, to code any information that has exactly 2 k possible values, we need 
at most k bits. Alternatively, we can phrase this as: to code information with n possible 
values, we need at most |~log 2 n\ bits. For example, in New York Lotto, which involves 
picking 6 distinct numbers from the values 1 through 48, there are ( 4 g 8 ) = 12,271,512 
possible combinations. Then, we need at most \log2i2, 271, 512] = 24 bits to code a New 
York Lotto ticket. Notice that in this discussion we ignore how difficult it is to construct 
the coder and decoder; to write a program that maps Lotto tickets to and from distinct 
24-bit strings is not trivial. It usually possible to find inefficient codes that are much easier 
to code and decode. For example, we could just store a Lotto ticket as text using the ASCII 
convention. 

Variable-Length Coding 

While the preceding analysis is optimal if we require that all of the bit strings mapped to are 
of the same length, in most cases one can do better on average if outcomes can be mapped 
to bit strings of different lengths. In particular, if certain outcomes are more frequent than 
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others, these should be mapped to shorter bit strings. While this may cause infrequent 
strings to be mapped to longer bit strings than in a fixed-length coding, this is more than 
made up by the savings from shorter bit strings since the shorter strings correspond to more 
frequent outcomes .0 

For example, let us consider the coding of the information of which of three consecutive 
coin flips, if any, is the first one to be heads. There are four possible outcomes: the first 
flip, the second flip, the third flip, or none of them. Thus, we can code the outcome using 
two bits with a fixed-length code. Now, let us consider a different coding, where we just use 
three bits to code the outcome of each of the three coin flips in order, using to mean tails 
and 1 heads. This is a valid coding, since we can still recover which of the flips yields the 
first head, but obviously this coding is less efficient than the previous one because it uses 
three bits instead of two. However, notice that in some cases some of the bits in this coding 
are superfluous. For example, if the first bit is 1, then we know the earliest flip to be heads 
is the first one regardless of the later flips, so there is no need to include the last two bits. 
Likewise, if the first bit is and the second bit is 1, we do not need to include the last bit 
because we know the earliest flip to be heads is the second one. Instead of a fixed-length 
code, we can assign the bit strings 1, 01, 001, and 000 to the four outcomes. Notice that 
the probabilities of these four outcomes are 0.5, 0.25, 0.125, and 0.125, if the coin is fair. 
Thus, on average we expect to use 0.5 x 1 + 0.25 x 2 + 0.125 x 3 + 0.125 x 3 = 1.75 bits 
to code an outcome, taking into account the relative frequencies of each outcome. This is 
superior to the average of two bits yielded by the optimal fixed-length code.0 

In general, if each outcome has a probability of the form 2~ k for k integer, then it 
is provably optimal to assign an outcome with probability 2~ k a bit string of length k. 
Alternatively phrased: given that the probabilities of all outcomes are of the form 2~ k , k 6 
M, an outcome with probability p is optimally coded using log 2 - bits. Thus, the code 
described in the last example is optimal, as log 2 ^ = 1, log 2 q-^ = 2, and log 2 = 3. 
Notice that in the case there are 2 k equiprobable outcomes, this formula just comes out to 
a fixed-length code of length k. 

Now, consider the case of coding probabilities that are not negative powers of two. For 
example, let us code the outcome of a single toss of a fair 3-sided die. Intuitively, we 
want to assign codeword lengths that are appropriate for probabilities that are negative 
powers of two that are near to the actual probabilities of each outcome. In fact, there is 
an algorithm for performing this assignment in an optimal way, namely Huffman coding 
( [Huffman, 1952 ). In this case, Huffman coding yields the codewords 0, 10, and 11. Clearly, 



the codeword lengths do not follow the relation that an outcome with probability p has 



8 We see this principle followed in language: most common words have short spellings, and long expres- 
sions that are used frequently in some context are often abbreviated. 

9 One may ask why we could not use an even shorter code, e.g., the bit strings 0, 1, 00, and 01. The reason 
is that it is required that codes are unambiguous even when used to code multiple trials consecutively. For 
example, if we coded two of the above trials consecutively with this new code and yielded the bit string 000, 
we would not be able to tell whether this should be interpreted as (0)(00) or (00) (0). One way to assure 
unambiguity is to require that no codeword is a prefix of another, as is satisfied by the original code given 
in the example. 
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codeword length log 2 | as in this case these values are not integers.^ 

However, consider the case where instead of coding a single toss, we are coding k tosses 
of a fair 3-sided die. Notice that there are 3 fc possible outcomes, and as mentioned earlier 
we can code this using |~log 2 3 fc ] bits using a fixed-length code. Hence, on average each coin 
toss requires ^" log ^ 3 - = ^ k lo | 2 bits. As k grows large, this approaches the value log 2 3. By- 
coding multiple outcomes jointly, we approach the limit where each individual outcome (of 

probability p = h) can be coded on average using iog 2 - = log 2 t = £0*72 3 bits; this is the 

p 3 

same relation we found when all probabilities were negative powers of two. 

This result extends to the case where not all outcomes are equiprobable; instead of fixed- 
length coding for the joint trials, we can use Huffman coding of the joint trials to approach 
this limit. In general, if a particular outcome has probability p, in the limit of coding a 
large number of trials, each of those outcomes will take on average log 2 | bits to code in the 
optimal coding ( [Shannon^ l~94c| |Cover and King, 1978| ). In fact, this limit can be realized in 



practice with an efficient algorithm called arithmetic coding ( Pasco, 1976 ; Rissanen, 1976) 



3.2.2 Description Lengths 

Now, let us return to the minimum description length principle and the meaning of a 
description. Recall that MDL states that one should minimize the sum of 1(G), the length 
of the description of the grammar, and l(0\G), the length of the description of the data 
given the grammar. A description simply refers to the bit string that is used to code the 
given information. Notice that we do not care what the actual bit string that composes a 
description is; we are only concerned with its length. 

First, let us consider l(0\G). Typically, G is a probabilistic grammar that assigns 
probabilities to sentences p(oi\G), and we can calculate the probability of the training 
data as p(0\G) = nr=i P(°i\G) where the training data O is composed of the sentences 
{oj,...,o n }. Then, using the result that an outcome with probability p can be coded 
optimally with log 2 | bits, we get that taking l(0\G) to be log 2 ^jiq) should yield the 
lowest lengths on average. 

Notice that for the MDL principle to be meaningful, we need to use an optimal coding 
as opposed to some arbitrary inefficient coding. There are many descriptions of a given 
piece of data. For example, for any description of some data given a grammar, we can 
create additional descriptions of the same data by just padding the end of the original 
description with O's. Clearly, it is easy to make descriptions arbitrarily long. However, it is 
not possible to make descriptions arbitrarily compact. There is a lower bound to the length 



of the description of any piece of data ( polomonoff, 1960| ; |Solomonoff, 1964| ; |Kolmogorov, 



1965|) , and we can use this lower bound to define a meaningful description length for a piece 
of data. This is why we choose an optimal coding for calculating l(0\G). This dictum 
of optimal coding extends as well to calculating 1(G), the length of the description of a 



10 However, Huffman coding does guarantee that on average outcomes are assigned codewords at most one 
bit longer than what is dictated by the log 2 i relation. To see how this bound can be achieved simply, we 
can just round down each probability to the next lower negative power of two, and assign codeword lengths 
as described earlier. 
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grammar. 

Thus, it is not appropriate to use textual descriptions of grammars as mentioned in Sec- 
tion 3.2, as this is rather inefficient. For example, consider the following textual description 
of a grammar segment: 



NP->D N 
D->a|the 

N->boat | cat | tree 



This textual description is 34 characters long (including carriage returns), which translates 
to 272 bits under the ASCII convention of eight bits per character. We can achieve a 
significantly smaller description using a more complex encoding, where the grammar is 
coded in three distinct sections: 

• We code the list of terminal symbols as text: 
a the boat cat tree 

which comes to 20 characters including the carriage return. 

• We code the number of nonterminal symbols and the number of grammar rules also 
as text: 



3 3 



which comes to 4 characters including the carriage return. Notice that the names of 
nonterminal symbols are not relevant in describing a grammar; these symbols can be 
renamed arbitrarily without affecting what strings the grammar generates. 

• Finally, we code the list of grammar rules, where each grammar rule is coded in several 
parts: 

— The nonterminal symbol on the left-hand side of a rule can be coded using two 
bits, as there are a total of three nonterminal symbols. 

— To code whether a rule is of the form A — * a\a2 ■ ■ ■ or of the form A — > a\\a2\ ■ ■ ■, 
we use a single bit. (In this example, we do not consider rules combining both 
forms.) 

— To code how many symbols are on the right-hand side of a rule, we use three 
bits. With three bits we can code up to a length of eight; if a rule is longer it 
can be split into multiple rules. 

— To code each symbol on the right-hand side of a rule, we use three bits to code 
which of the eight possible symbols it is (three nonterminal, five terminal). 

Under this coding, the first two rules each take 12 bits, and the third takes 15 bits. 
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This comes to a total of 24 characters for the first two sections, or 192 bits under the 
ASCII convention, and 39 bits for the last section, yielding a total of 231 bits. This is 
significantly less than the 272 bits of a naive textual encoding. Using more advanced tech- 
niques that will be described later, grammar descriptions can be made even more compact. 
In addition, the grammars we will be using later will be probabilistic, so we will also have 
to code probability values. 

Just as we used p(0\G) to calculate l(0\G), we can use the prior probability p(G) 
mentioned in equation (|3.1[) to give us insight into 1(G). According to coding theory, to 
calculate the optimal length 1(G) of a grammar G we need to know the probability of 
the grammar p(G). An alternative approach to explicitly designing encodings like above 
is instead to design a prior probability p(G) and to define an encoding such that 1(G) = 
log 2 just as we did for l(0\G). However, unlike p(0\G) the distribution p(G) is not 
straightforward to estimate. Furthermore, it is important to note that in order for a coding 
to be optimal (i.e., produce the shortest descriptions on average), the underlying probability 
distribution must be accurate. For some distribution p(G), we know that using log 2 
bits to code a grammar G is optimal only if p(G) is the correct underlying distribution on 
grammars. 



For instance, consider the example given in Section |3.2.1| of coding which of three con- 
secutive coin flips is the first to turn up heads. A fixed-length code requires two bits 
to code this, and we showed that by assigning the codewords 1, 01, 001, and 000 to 
the outcomes: first flip, second flip, third flip, and no flip, respectively, we can achieve 
an improved average of 1.75 bits, assuming the coin is fair. However, consider a biased 
coin whose probability of heads is j. Then, the frequencies of the four outcomes become 
1/4, 3/16, 9/64, and 27/64, respectively, and this yields an average codeword length of 
(1/4 x 1) + (3/16 x 2) + (9/64 x 3) + (27/64 x 3) = 2.3125 bits, which is significantly worse 
than the fixed-length code. Thus, we see that for a coding to be efficient we must have an 
accurate model of the data. 

Applying this observation to coding grammars, we see that deriving the lengths 1(G) of 
grammars from a prior p(G) is no better than estimating 1(G) directly; we have no guarantee 
that the prior p(G) we choose is at all accurate. However, this relationship does provide 
us with another perspective with which to view grammar encodings. For every grammar 
encoding describing grammar lengths 1(G) there is an associated prior p(G) = 2"'( G ), and 
we should choose encodings that lead to priors p(G) that are good models of grammar 
frequency. For example, for grammars G we perceive to be typical, i.e., to have high 
probability p(G), we want 1(G) to be low. In other words, we want typical grammars to 
have short descriptions. Hence, referring to the two grammar encodings given earlier, as 
the latter grammar encoding assigns shorter descriptions to typical grammars than the 
naive encoding,^] we conclude that in some sense the latter encoding corresponds to a more 



Actually, this is not clear. For smaller grammars, the complex encoding should be more efficient since, 
for example, it can code symbol identities using a small number of bits while in a text representation a symbol 
is represented using a minimum of one character, or eight bits. For large grammars, text encodings may 
be more efficient since they can express variable4ength encodings of symbol identities, while the complex 
encoding assumes fixed-length encodings of symbol identities. 
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accurate prior probability on grammars. 



3.2.3 The Minimum Description Length Principle 

As touched on in the last section, the observation that an object with probability p should 
be coded using log 2 ^ bits gives us a way to equate probabilities and description lengths, and 
this is the key in showing the relation between the minimum description length principle 
and the Bayesian framework. Under the Bayesian framework, we want to find the grammar 

G = avg max p(0\G)p(G) . 
G 

Under the minimum description length principle, we want to find the grammar 

G = argmm[l{0\G) + l(G)]. 

G 



Then, we get that 



G = arg maxp(0\G)p(G) 
G 

= argminf— \og 2 p{0\G)p{G)] 

G 

1 , 1 

= arg min|log 2 , , , + log 2 



g °<p(0\G) oz p(GY 

= argmin[/ p (0|G) +l p (G)] 
G 

where l p (a) denotes the length of a under the optimal coding given p. Thus, any problem 
framed in the Bayesian framework can be converted to an equivalent problem under MDL, by 
just taking the description lengths to be the optimal ones dictated by the given probabilities. 
Likewise, any problem framed under MDL can be converted to an equivalent one in the 
Bayesian framework, by choosing the probability distributions that would yield the given 
description lengths. For example, it is easy to see that the Bayesian prior corresponding to 
the MDL principle is p(G) = 2~ l<yG \ as touched on earlier. 

Thus, from a mathematical point of view, the minimum description length principle 
does not give us anything above the Bayesian framework. However, from a paradigmatic 
perspective, MDL provides two important ideas. 

Firstly, MDL gives us a new perspective for creating prior distributions on grammars. By 
noticing that any grammar encoding scheme implicitly describes a probability distribution 
p(G) = 2~ l<yG \ we can create priors by just designing encoding schemes. For example, 



both of the grammar encoding schemes used in Section |3.2.2| lead to prior distributions 
rather different from those usually found in probability theory. Viewing prior distributions 
in terms of encodings extends the toolbox one has for designing prior distributions. In 
addition, one can mix and match conventional prior distributions from probability theory 
with those stemming from an encoding perspective. 

Secondly, it has been observed that "MDL-style" priors of the form p(G) = 2~^ G ) can be 
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good models of the real world (Solomonoff, 1964; Rissancn, 1978; Li and Vitanyi, 1993[ ), P| 



To demonstrate this, let us consider some examples of real- world data. Let us say you see 
one hundred flips of a coin, and each time it turns up heads. Clearly, you expect a head with 
very high probability on the next toss.0 Or, let us say you peek at someone's computer 
terminal and see the following numbers output: 2, 3, 5, 7, ... , 83, 89. Then, you expect 
the next number to be output to be 97 with very high probability. Or, let us say you look 
at some text and notice that after each of the ten occurrences of the word Gizzard's the 
word Gulch appears immediately afterwards. Then, if you see the word Gizzard's again 
you expect the word Gulch will follow with high probability. In general, when you notice a 
pattern in some data in the real world, you expect the pattern to continue in later samples 
of the same type of data. 

This behavior can be captured with an MDL-style prior. In particular, we can capture 
this behavior by choosing a prior that assigns high probabilities to data that can be described 
with short programs. By programs, we mean programs written in a computer language such 
as Pascal or Lisp. For example, let us take our programming language to be a Pascal-like 
pseudo-code. Now, consider estimating the probability that a coin turns up heads on the 
next toss, given that all hundred previous tosses of the coin yielded heads. That is, we want 
to estimate 



J2 x ={h,t} P(!00 h's, x) p(101 h's) + p(100 h's, t) ' 
Intuitively, this probability should be high, so we want 

p(lOlfc's) >p(100/i's,lt). 

A program that outputs 101 h's is significantly shorter than a program that outputs 100 
h's and a t. For example, for the former we might have 

for i := 1 to 101 do 
print "h"; 

while for the latter we might have 

for i := 1 to 100 do 

print "h"; 
print "t"; 

Thus, by assigning higher probabilities to data that can be generated with shorter programs, 
we get the desired behavior on this example. 



12 Closely related to the minimum description length principle is the universal a priori probability. The 
universal a priori probability can be shown to dominate all enumerable prior distributions by a constant. The 
minimum description length principle can be thought of as a simplific ation of this elegant bu t incomputable 



universal distribution. A thorough discussion of this topic is given by Li and Vitanyi (1993). 

13 If you knew the coin was fair, then you would still expect the next toss to be heads with probability 
0.5, as in the canonical grade school example. However, it is rare that you know with absolute certainty 
that a coin is fair. 
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Similarly, for the case of predicting the next output given the preceding outputs 2, 3, 5, 
7, . . . , 89, we want that 

p(2, 3,5,..., 89, 97) > p(2, 3, 5, ... , 89, x) 

for x ^ 97. Again, a program that generates the former will generally be shorter than one 
that generates the latter. For example, we might have 

for i := 2 to 97 do 

<code for printing out i if it is prime> 

as opposed to 

for i := 2 to 89 do 

<code for printing out i if it is prime> 
print x; 

For the example where the word Gulch always follows the word Gizzard's the ten times 
the word Gizzard's occurs, and where we want to estimate the probability that the word 
Gulch follows Gizzard's in its next occurrence, consider a program that encodes text using 
a bigram-like model. Assume that for efficiency, the program only explicitly codes those 
bigram probabilities that are non-zero, as only a small fraction of all bigrams occur in 
practice. To model the case where Gulch does follow Gizzard's in its next occurrence, we 
only need to code a single nonzero probability of the formp(x| Gizzard's), i.e., for x = Gulch. 
However, if a different word follows Gizzard's, to model this new data we need an additional 
nonzero probability of the form p(x\ Gizzard's). Thus, presumably the program (including 
the description of its bigram model) coding this latter case will be larger than the one 
coding the former case. Thus, by assigning higher probability to data generated with 
smaller programs, we get the desired behavior of predicting Gulch with high probability. 

Now, notice that using a prior of the form p(G) = 2~ 1 ^ results in this behavior if we 
just replace grammars G with programs G p . That is, we can express the probability p(0) 
of some data or observations O as 

p(0) = Y,P(0,G P ) = 5>(G p )p(0|G p ) = ]T p(G p ) 

G p G p output(Gp) = O 

where we have p(0\G p ) = 1 if the output of program G p is O and p(0\G p ) = otherwise. 
Substituting in the prior on programs p{G p ) = 2~ l( - G v\ we get 

P (o) = J2 2 ~ l(Gp) 

output(Gp) = O 

which gives us that data that can be described with shorter programs have higher proba- 
bility. 

While the MDL-style prior p(G p ) = 2~ l ( Gp ) yields this nice behavior, there are several 
provisos. First of all, notice that this prior is not appropriate for making precise predictions. 
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For example, while in the above examples we make arguments about the relative magnitude 
of different probabilities, it would be folly to try to nail down actual probabilities and expect 
them to be accurate. Also, notice that we used data sets of non-trivial size; this is because 
the inaccuracy of this type of prior is especially marked for small data sets. For example, 
consider the case of predicting the next value in the sequence 2, 3, 5, 7. In this case, it is 
unlikely that the shortest program that outputs this sequence is of the form 

for i := 2 to 7 do 

<code for printing out i if it is prime> 

and the argument given earlier for the longer sequence of primes does not hold. Instead, a 
shorter program would be 

print "2, 3, 5, 7"; 

Hence, for this short sequence of primes it is unclear whether the MDL-style prior would 
predict 11 with high probability, even though intuitively this is the correct prediction. 

Both of these issues are related to the fact that there are many different programming 
languages we could use to describe programs, and that the same program in different lan- 
guages may have very different lengths. Thus, the specific behavior of the prior depends 
greatly on the language used. However, for large pieces of data the relative differences in 
program length between programming languages becomes smaller. For example, if a pro- 
gram is 10 lines in Lisp and 1,000 lines in Basic, this is a relatively large difference. However, 
a 100,010-line Lisp program and a 101,000-line Basic program are nearly the same length 
from a relative perspective.^] Thus, for large pieces of data the prior will yield qualitatively 
similar results independent of programming language. 

In any case, we choose an MDL-style prior in this work because of the observation that 
by assigning higher probabilities to smaller programs we get a very rich behavior that seems 
to model the real-world fairly well. However, instead of considering a general programming 
language, we tailor our description language to one that describes only probabilistic context- 
free grammars. Considering a restricted language simplifies the search problem a great 
deal, and context-free grammars are able to express many of the important properties 
of language. Furthermore, we observed above that a general MDL prior cannot make 
quantitatively accurate predictions. In this work, we attempt to tailor the prior so that 
meaningful quantitative predictions can be made in the language domain. 

To summarize, we treat grammar induction as a search for the grammar G with the 
highest probability given the data or observations O, which is equivalent to finding the 
grammar G that maximizes the objective function p(0\G)p(G), the likelihood of the training 
data multiplied by the prior probability of the grammar. We take the prior p(G) to be 2~'( G ) 
as dictated by the minimum description length principle. While this framework does not 



For any two Turing-machine-equivalent languages, there exists a constant c such that any program in 
one language, say of length I bits, can be duplicated in the other language using at most I + c bits. The 
general idea behind the proof is that you can just write an interpreter (of length c bits) for the former 
language in the latter language. 
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restrict us to a particular grammar formalism, in this work we consider only probabilistic 
context-free grammars, as it is a fairly simple, yet expressive, representation. We describe 
our search strategy in Section [O. We describe what encoding scheme we use to calculate 



1(G) in Section 3.5 



3.3 Algorithm Outline 

We assume a simple greedy search strategy.^ We maintain a single hypothesis grammar 
that is initialized to a small, trivial grammar. We then try to find a modification to the 
hypothesis grammar, such as the addition of a grammar rule, that results in a grammar 
with a higher score on the objective function. When we find a superior grammar, we make 
this the new hypothesis grammar. We repeat this process until we can no longer find a 
modification that improves the current hypothesis grammar. 

For our initial grammar, we choose a grammar that can generate any string, to assure 
that the grammar assigns a nonzero probability to the training data.^] At the highest level 
of the grammar, we have the rules 

S -> SX (1-e) 
S -» X (e) 

expressing that a sentence S is a sequence of X's. The quantities in parentheses are the prob- 
abilities associated with the given rules; we describe e and other rule probability parameters 



in detail in Section 3.5.1. 



Then, we have rules 

X - A (p(A)) 

for every nonterminal symbol A ^ S,X in the grammar. Combined with the earlier rules, 
we have that a sentence is composed of a sequence of independently generated nonterminal 
symbols. We maintain this property throughout the search process; that is, for every 
symbol A that we add to the grammar, we also add a rule X — > A. This assures that the 
sentential symbol can expand to every symbol; otherwise, adding a symbol will not affect 
the probabilities that a grammar assigns to strings. 
To complete the initial grammar, we have rules 

A a -> a (1) 

for every terminal symbol or word a. That is, we have a nonterminal symbol expanding 
exclusively to each terminal symbol. With the above rules, the sentential symbol can expand 



While searches that maintain a population of hypotheses can yield better performance, it is unclear 
how to efficiently maintain multiple hypotheses in this domain because each hypothesis is a grammar that 
can potentially be very large. However, stochastic searches such as simulated annealing could be practical, 
though we have not tested them. 

16 Otherwise, the objective function will be zero, and unless there is a single move that would cause the 
objective function to be nonzero, the gradient will also be zero, thus making it difficult to search intelligently. 
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s — 


y SX (1 - e) 


s — 


> X (e) 


X - 


> A (p(A)) V A£ N- 


A a 


► a (1) V a G T 


N = 


the set of all nonterminal symbols 


T = 


the set of all terminal symbols 



Probabilities for each rule are in parentheses. 



Table 3.1: Initial hypothesis grammar 



to every possible sequence of words. (For every symbol A a , there will be an accompanying 



rule X — > A a .) The initial grammar is summarized in Table 3.1. 

We use the term move set to describe the set of modifications we consider to the current 
hypothesis grammar to hopefully produce a superior grammar. Our move set includes the 
following moves: 

Move 1: Create a rule of the form A — > BC [concatenation) 
Move 2: Create a rule of the form A — ► B\C [classing) 

For any context-free grammar, it is possible to express a weakly equivalent grammar using 
only rules of these forms. As mentioned before, with each new symbol A we also create a 
rule X — ► A. We describe the move set in more detail in Section |3 . 5 . 2| . P~^| 



3.3.1 Evaluating the Objective Function 

Consider the task of calculating the objective function p[0\G)p[G) for some grammar G. 
Calculating p[G) = 2~ l<yG ^ turns out to be inexpensive; however, calculating p[0\G) requires 
evaluating the probability p[oi\G) for each sentence Oj in the training data, which entails 
parsing each sentence in the training data. We cannot afford to parse the training data for 
each grammar considered; indeed, to ever be practical for large data sets, it seems likely 
that we can only afford to parse the data once. 

To achieve this goal, we employ several approximations. First, notice that we do not 
ever need to calculate the actual value of the objective function; we need only to be able to 
distinguish when a move applied to the current hypothesis grammar produces a grammar 
that has a higher score on the objective function. That is, we need only to be able to 
calculate the difference in the objective function resulting from a move. This can be done 
efficiently if we can quickly approximate how the probability of the training data changes 
when a move is applied. 

To make this possible, we approximate the probability of the training data p[0\G) by 
the probability of the single most probable parse, or Viterbi parse, of the training data. 



In this chapter, we will use the symbols A, B, . . . and symbols of the form A a ,B a , ... to denote general 
nonterminal symbols, i.e., nonterminal symbols other than S and X. 
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Figure 3.3: Initial Viterbi parse 



S X 

X B 

A-Bob A t alks ^slowly 

Bob talks slowly 



S X 

X B 

Auary A ta [ ks A s i ow [ y 

Mary talks slowly 



Figure 3.4: Predicted Viterbi parse 
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Furthermore, instead of recalculating the Viterbi parse of the training data from scratch 
when a move is applied, we use heuristics to predict how a move will change the Viterbi 
parse. For example, consider the case where the training data consists of the two sentences 

O = {Bob talks slowly, Mary talks slowly} 



In Figure 3.2, we display the Viterbi parse of this data under the initial hypothesis grammar 
of our algorithm. 

Now, let us consider the move of adding the rule 

B ► -^-talks -^-slowly 

to the initial grammar (as well as the concomitant rule X — ► B). A reasonable heuristic 
for predicting how the Viterbi parse will change is to replace adjacent X's that expand to 
A ta ik s and A s i ow i y respectively with a single X that expands to B, as displayed in Figure |3-4 
This is the actual heuristic we use for moves of the form A — ► BC, and we have analogous 
heuristics for each move in our move set. By predicting the differences in the Viterbi parse 
resulting from a move, we can quickly estimate the change in the probability of the training 
data. 

Notice that our predicted Viterbi parse can stray a great deal from the actual Viterbi 
parse, as errors can accumulate as move after move is applied. To minimize these effects, 
we process the training data incrementally. Using our initial hypothesis grammar, we parse 
the first sentence of the training data and search for the optimal grammar over just that one 
sentence using the described search framework. We use the resulting grammar to parse the 
second sentence, and then search for the optimal grammar over the first two sentences using 
the last grammar as the starting point. We repeat this process, parsing the next sentence 
using the best grammar found on the previous sentences and then searching for the best 
grammar taking into account this new sentence, until the entire training corpus is covered. 

Delaying the parsing of a sentence until all of the previous sentences are processed should 
yield more accurate Viterbi parses during the search process than if we simply parse the 
whole corpus with the initial hypothesis grammar. In addition, we still achieve the goal of 
parsing each sentence but once. 

3.3.2 Parameter Training 

In this section, we describe how the parameters of our grammar, the probabilities associated 
with each grammar rule, are set. Ideally, in evaluating the objective function for a particular 
grammar we should use its optimal parameter settings given the training data, as this is the 
full score that the given grammar can achieve. However, searching for optimal parameter 
values is extremely expensive computationally. Instead, we grossly approximate the optimal 
values by deterministically setting parameters based on the Viterbi parse of the training 
data parsed so far. We rely on the post-pass, described later, to refine parameter values. 



Referring to the rules in Table 3.1, the parameter e is set to an arbitrary small constant. 



Roughly speaking, the values of the parameters p{A) are set to the frequency of the X — > A 
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reduction in the Viterbi parse of the data seen so far and the remaining symbols are set to 
expand uniformly among their possible expansions. This issue is discussed in more detail 
in Section |3.5,1 . 



3.3.3 Constraining Moves 

Consider the move of creating a rule of the form A — > BC . This corresponds to k s different 
specific rules that might be created, where k is the current number of nonterminal symbols 
in the grammar. As it is too computationally expensive to consider each of these rules at 
every point in the search, we use heuristics to constrain which moves are appraised. 

For the left-hand side of a rule, we always create a new symbol. This heuristic selects 
the optimal choice the vast majority of the time; however, under this constraint the moves 
described earlier in this section cannot yield arbitrary context-free languages. A symbol 
can only be defined in terms of symbols created earlier in the search process, so recursion 
cannot be introduced into the grammar. To partially address this, we add the move 

Move 3: Create a rule of the form A — > AB\B 

This creates a symbol that expands to an arbitrary number of -B's. With this iteration 
move, we can construct grammars that generate arbitrary regular languages. In Section 



3.5. 5| , we discuss moves that extend our coverage to arbitrary context-free languages. 

To constrain the symbols we consider on the right-hand side of a new rule, we use what 
we call triggers)^ A trigger is a configuration in the Viterbi parse of a sentence that is 
indicative that a particular move might lead to a better grammar. For example, in Figure 



3T3| the fact that the symbols A ta iks and A s i ow i y occur adjacently is indicative that it could 
be profitable to create a rule B — ► A ta ik s A s i 0W i y . We have developed a set of triggers for 
each move in our move set, and only consider a specific move if it is triggered somewhere 
in the current Viterbi parse. 

3.3.4 Post-Pass 

A conspicuous shortcoming in our search framework is that the grammars in our search 
space are fairly unexpressive. Firstly, recall that our grammars model a sentence as a 
sequence of independently generated symbols; however, in language there is a large depen- 
dence between adjacent constituents. Furthermore, the only free parameters in our search 
are the parameters p(A); all symbols besides S and X are fixed to expand uniformly. These 
choices were necessary to make the search tractable. 

To address these issues, we use an Inside-Outside algorithm post-pass. Our methodology 
is derived from that described by Lari and Young (1990| ). We create n new nonterminal 



symbols {Xi, . . . , X n }, and create all rules of the form: 

Xi -» Xj X k i,j,k € {1,-.- ,n} 

Xi —> A %€ {!,..., n}, Ae 7V oW -{S,X} 



3 This is not to be confused with the use of the term triggers in dynamic language modeling. 
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N id denotes the set of nonterminal symbols acquired in the initial grammar induction phase, 
and Xi is taken to be the new sentential symbol. These new rules replace the first three 



rules listed in Table 3.1. The parameters of these rules are initialized randomly. Using this 
grammar as the starting point, we run the Inside-Outside algorithm on the training data 
until convergence. 

In other words, instead of using the naive S — > SX | X rule to attach symbols together 
in parsing data, we now use the Xi rules and depend on the Inside-Outside algorithm 
to train these randomly initialized rules intelligently. This post-pass allows us to express 
dependencies between adjacent symbols. In addition, it allows us to train parameters that 
were fixed during the initial grammar induction phase. 

3.3.5 Algorithm Summary 



We summarize the algorithm, excluding the post-pass, in Figure |3.5|. In Section 3.4 



we 



relate our algorithm to previous work on grammar induction. In Section 3.5, we flesh out 
the details of the algorithm, including the move set, the encoding of the grammar used to 
calculate the objective function, and the parsing algorithm used. In addition, we describe 
extensions to the basic algorithm that we have implemented. 

3.4 Previous Work 

3.4.1 Bayesian Grammar Induction 

Work by Solomonoff (196C; 1964) is the first to lay out the general Bayesian grammar 



induction framework that we use. Solomonoff points out the relation between encodings and 
prior probabilities, and using this relation describes objective functions for several induction 
problems including probabilistic context-free grammar induction. Solomonoff evaluates the 
objective function manually for a few example grammars to demonstrate the viability of the 
given objective function, but does not specify an algorithm for automatically searching for 
good grammars. Solomonoff 's work can be seen as a precursor to the minimum description 
length principle, and would in fact lead to the closely related universal a priori probability 
( |Solomonoff,l960| ; (Solomonoff , 1964] |Li and Vitanyi, 1993| ). 



Pook et al. (1976| ) present a probabilistic context-free grammar induction algorithm 
that employs a similar framework. While not formally Bayesian, their objective function 
strongly resembles a Bayesian objective function. In particular, their objective function is 
a weighted sum of the complexity of a grammar and the discrepancy between the grammar 
and the training data. The first term is analogous to a prior on grammars, and the second is 
analogous to the probability of the data given the grammar. However, the actual measures 
used for complexity and discrepancy are rather dissimilar from those used in this work. 

Their initial hypothesis grammar consists of the sentential symbol expanding to every 
sentence in the training data and only those strings. This contrasts to the simple overgen- 
erating grammar we use for our initial grammar. For Cook et al, initially the length of the 
grammar is large and the length of the data given the grammar is small, while the converse 
is true for our approach. We hypothesize that neither approach is inherently superior; in 
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; G holds current hypothesis grammar 

; V holds best parse for each sentence seen so far 

G := initial hypothesis grammar 

V :=e 



; training data is composed of sentences (01, . . . , o n ) 
for i := 1 to n do 
begin 

; calculate best parse V% of current sentence and append 
; to V, the list of best parses 

Vi := best parse of sentence o% under grammar G 
V := append(V, Vi) 

; T holds the list of triggers yet to be checked 
T := set of triggers in Vi 
while T / e do 
begin 

; pick the first trigger t in T and remove from T 

t := first(T) 

T := remove(T, t) 



; check if associated move is profitable, if so, apply 
m := move associated with trigger t 

G rn := grammar yielded if move m is applied to the grammar G, 

including parameter re-estimation 
V m ■= best parse yielded if move m is applied to G 
A := change in objective function if G becomes G m and V becomes V m 
if A > then 
begin 
G 
V 
T 

end 

end 

end 



V m 

append(T, new triggers in V) 



Figure 3.5: Outline of search algorithm 
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the former approach one just chooses moves that tend to compact the grammar, while in 
the latter approach one chooses moves that tend to compact the data. 

The move set used by Cook et al. includes: substitution, which is analogous to our 
concatenation move except that instead of a new symbol being created on the left-hand 
side existing symbols can be used as well; disjunction, which is analogous to our classing 
rule; a move for removing inaccessible productions; and a move for merging two symbols 
into one. They describe a greedy heuristic search strategy, and present results on small 
data sets, the largest being tens of sentences. Their work is geared toward finding elegant 
grammars as opposed to finding good language models, and thus they do not present any 
language modeling results. 



Stolcke and Omohundro (1994 ) also present a similar algorithm. Again, there are several 
significant differences from our approach. They adhere to the Bayesian framework as we do; 
however, their prior on grammars is divided into two different terms: a prior on grammar 
rules, and a prior on the probabilities associated with grammar rules. For the former, they 
use an MDL-like prior p(G) = c~ 1 ^ where c is varied during the search process. For the 
latter, they use a Dirichlet prior. In our work, both grammar rules and rule probabilities are 
expressed within the MDL framework. In addition, like Cook et al., Stolcke and Omohundro 
choose an initial grammar where the sentential symbol expands to every sentence in the 
training data and only those strings. 

The most important differences between this work and ours concern the move set and 
search strategy. Stolcke and Omohundro describe only two moves: a move for merging 
two nonterminal symbols into one, and a move named chunking that is analogous to our 
concatenation move. As their search strategy, they use a beam search, which requires 
maintaining multiple hypothesis grammars. They describe how this is necessary because 
with their move set, often several moves must be made in conjunction to improve the 
objective function. We have addressed this problem in our work by using a rich move set 
(see Section 3.5.2); we have complex moves that hopefully correspond to those move tuples 



of Stolcke and Omohundro that often lead to improvements in the objective function. In 
addition, at each point in the search they consider every possible move, as opposed to 
using the triggering heuristics we use to constrain the moves considered. Because of these 
differences, we assume our algorithm is significantly more efficient than theirs. They do not 
present any results on data sets approaching the sizes that we used, and like Cook et al., 
they do not present any language modeling results. 

3.4.2 Other Approaches 

The most widely-used tool in probabilistic grammar induction is the Inside-Outside algo- 
rithm ( Baker, 1979|) , a special case of the Expectation-Maximization algorithm ( pempster 



et al, 1977 ). The Inside-Outside algorithm takes a probabilistic context-free grammar and 
adjusts its probabilities iteratively to attempt to maximize the probability the grammar as- 
signs to some training data. It is a hill-climbing search; it generally improves the probability 
of the training data in each iteration and is guaranteed not to lower the probability. 



Lari and Young (1990; 1991) have devised a grammar induction algorithm centered on 
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the Inside-Outside algorithm. In this approach, the initial grammar consists of a very large 
set of rules over some fixed number of nonterminal symbols. Probabilities are initialized 
randomly, and the Inside-Outside algorithm is used to prune away extraneous rules by 
setting their probabilities to near zero; the intention is that this process reveals the correct 
grammar. In Section |3.6| , we give a more detailed description. Lari and Young present 
results on various training corpora, with some success. In our experiments, we replicate the 
Lari and Young algorithm for comparison purposes. 

Pereira and Schabes (1992 ) extend the Lari and Young work by training on corpora 
that have been manually parsed. They use the manual annotation to constrain the Inside- 
Outside training. However, their goal was parsing as opposed to language modeling, so no 
language modeling results are reported. 

Carroll (1995| ) describ es a heuristic algorithm for grammar induction that employs the 
Inside-Outside algorithm extensively. Carroll restricts the grammars he considers to a 
type of probabilistic dependency grammars, which are a subset of probabilistic context-free 
grammars. In particular, he only considers grammars where there is one nonterminal symbol 
A associated with each terminal symbol A and no other nonterminal symbols. Furthermore, 
all rules expanding a nonterminal symbol A must have the corresponding terminal symbol 
A somewhere on the right-hand side. 

Carroll begins with a seed grammar that is manually constructed. The training corpus 
is parsed a sentence at a time, and he has heuristics for adding new grammar rules if a 
sentence is unparsable with the current grammar. The Inside-Outside algorithm is used 
during this process as well as afterwards to refine rule probabilities. In addition, there are 
manually-constructed constraints on the new rules that can be created. 

Carroll reports results for building language models for part-of-speech sequences cor- 
responding to sentences. Training on 300,000 words/part-of-speech tags from the Brown 
Corpus, he reports slightly better perplexities on test data than trigram part-of-speech tag 
models on the ~ 99% of the sentences the grammar can parse. In addition, by linearly in- 
terpolating the grammatical model and the trigram model, he achieves a better perplexity 
on the entire test set than the trigram model alone. 



McCandless and Glass (1993 ) present a heuristic grammar induction algorithm that 



does not use the Inside-Outside algorithm. They begin with a grammar consisting of the 
sentential symbol expanding to every sentence, and they have a single move for improving 
the grammar, a move that combines classing and concatenation. However, they do not take 
a Bayesian approach in determining which moves to take. Instead, classing is based on how 
similar the bigram distributions of two symbols are, and concatenation is based on how 
frequently symbols occur adjacently. For evaluation, they build an n-gram symbol model 
using the symbols induced. That is, instead of predicting the next word based on the last 
n — 1 words, they predict the next word based on the last n — 1 symbols. With these n-gram 
symbol models, they achieve slightly better perplexity on test data than the corresponding 
n-gram word models. 
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3.5 Algorithm Details 



3.5.1 Grammar Specification 

In this section, we describe in detail the forms of the grammar rules we consider and we 
discuss how rules are assigned probabilities. 



Recall the structure of the grammar we use as described in Section |3.3| . We have rules 
expressing that a sentence S is a sequence of X's 

S -> SX (1-e) 
S -> X (6) 

where the quantity in parentheses is the probability associated with the given rule. We 
take e to be an arbitrarily small constant so that it can be safely ignored in the objective 
function calculation .0 

Then, we have a rule of the form 

X - A (p(A)) 

for each nonterminal symbol A ^ S, X. To calculate p(A), we use the frequency with which 
the associated rule has been used in past parses. We keep track of c(X — ► A), the number 
of times the rule X — > A is used in the current best parse V (see Figure |3.5[) , and we just 
normalize this value to yield p(A) as follows: 

... c(X^A) 
p(A) - 



E A c(X^A) 

Finally, we have rules that define the expansions of the nonterminal symbols besides S 
and X. We restrict such nonterminal symbols to expand in exactly one of four ways: 

• expansion to a terminal symbol: A — > a 

• concatenation of two nonterminal symbols: A — ► BC 

• classing of two nonterminal symbols: A — > B\C 

• repetition of a nonterminal symbol: A — * AB\B 

For instance, we do not allow a symbol A — * a\BC that expands to both a terminal symbol 
and a concatenation of two nonterminal symbols. 

While composing rules of the first three forms is sufficient to describe any context-free 
language, the move set we use cannot introduce recursion into the grammar. We add the 
fourth form to model a simple but common instance of recursion. Even with this extra 



19 In a parse tree, the latter rule can be applied at most once while the former rule can be applied many 
times. Thus, the probability contributed to a parse by these rules is of the form (1 — () k e. For small e, this 
expression is very nearly equal to just e, a constant. As we are only concerned with changes in the objective 
function, constant expressions can be ignored. 
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Figure 3.6: Example class hierarchy 



form, we can still only model regular languages; in Section 3.5.5 we describe extensions that 
release this restriction. 

For rules of the first two forms, the probability associated with the rule is 1 — p s , where 
p s is the probability associated with smoothing rules, which will be discussed in the next 
section. 

For classing rules, we choose probabilities to form a uniform distribution over all symbols 
the class can expand to, as defined as follows. First, notice that while a given classing rule 
only classes two symbols, by composing several of these rules you can effectively class an 
arbitrary number of symbols. For example, consider the rules 

A x -» A 2 \A 3 
A 2 -» A 4 |A 5 
A 4 -> A 6 |A 7 



which we can express using a tree as in Figure 3.6. We can see that A\ classes together 
the symbols A3, A5, Aq, and A7 at its leaves. Then, instead of assigning 0.5 probability to 
A\ expanding to each of A 2 and .A3, we assign 0.25 probability to A\ expanding to each 
of .A3, A5, Aq, and A7. That is, we assign a uniform distribution on all symbols the class 
recursively expands to at its leaves, not a uniform distribution on the symbols that the 
class immediately expands to. (In this example, we assume that A3, A.5, A$, and A7 are 
not classing symbols themselves.) Then, to satisfy these leaf expansion probabilities, we 
have that A 2 expands to A4 with probability 2/3 and A5 with probability 1/3, and A\ 
expands to A 2 with probability 3/4 and A3 with probability 1/4. Notice that this uniform 
leaf probability constraint assigns consistent probabilities among different symbols. The 
probabilities in the class hierarchy we set to satisfy uniform leaf probabilities for A\ are 
the same we use to satisfy the uniform leaf probabilities for A 2 . In actuality, the preceding 
discussion is not quite accurate as we multiply each of the probabilities by 1 — p s as in the 
non-classing rules, to allow for smoothing. 

For the repetition rule, while the rule A — ► AB\B is accurate in terms of the strings 
expanded to, it is inaccurate in terms of the way we assign probabilities to expansions. In 
particular, the probability we assign to the symbol A expanding to exactly n B's is 

6 1 



p(A ^* B n ) = PMDL(n) 



7r 2 n[log 2 (n + l)p 



where Pmdl is the universal MDL prior over the natural numbers ( Rissanen, 1989 ). We 
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choose this parameterization because it prevents us from needing to estimate the probability 
of the A — > AB expansion versus the A — > B expansion, and because in some sense it is 
the most conservative distribution one can take, in that it asymptotically assigns as high 
probabilities to large n as possible. Again, in actuality the above probabilities should be 
multiplied by 1 — p s for smoothing. 

Notice that we have minimized the number of parameters we need to estimate; this is 
to simplify the search task. So far, the only parameters we have described are the p(A) 
associated with rules of the form X — ► A, which are estimated deterministically from the 
best parse V, and p s , the probability assigned to smoothing rules. In Section 3.5.5| , we 
describe extensions where we consider richer parameterizations. 

3.5.2 Move Set and Triggers 

In this section, we list the moves that we use to adjust the current hypothesis grammar, 
and the patterns in the current best parse that trigger the consideration of a particular 
instance of a move form. These moves all take on the form of adding a rule to the grammar; 



in Section |3,5.q we consider rules of different forms. For the rule added to the grammar, 



we generate a new, unique symbol A for the left-hand side of the rule and also create a 



concomitant rule of the form X — ► A as mentioned in Section Recall that whether a 
move is taken depends on whether it improves the objective function, and that in order to 
estimate this effect we approximate how the best parse of previous data changes. In this 
section, we also describe for each move how we approximate its effect on the best parse if 
the move is applied. 

For future reference, we use the notation A a to refer to a nonterminal symbol that 
expands exclusively to the string a. For example, the symbol A^ob expands to the word 
Bob and no other strings. 

Concatenation Rules 

To trigger the creation of rules of the form A — > BC, we look for two adjacent instances of 
the symbol X expanding to the symbols B and C respectively. (Recall that the sentential 
symbol S expands to a sequence of X's.) For example, in Figure |3.7|, the following rules are 
triggered: A -» A Bo bAtaiks, A -» A ta i ks A s i 0W i y , and A -> A M aryA t aiks- To approximate how 
the best parse changes with the creation of a rule A — * BC, we simply replace all adjacent 
pairs of X's expanding to B and C with a single X expanding to A. In this example, if the 
rule A\ — > A ta ik s A s i 0W i y is actually created, then we would estimate the best parse to be as 



in Figure 3.8 



Classing Rules 

To trigger the creation of rules of the form A — > B\C, we look for cases where by forming the 
classing rule, we can more compactly express the grammar. For example, consider Figure 



Q| , These parses trigger concatenation rules of the form A — * AsobAi and A — > AM ar yA\. 



Now, if we create a rule of the form A2 — ► AsoblAMary, instead of creating two different 
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S X 

S X A S l 0W [y 

X A talks slowly 
A Bob talks 
Bob 




Figure 3.7: Triggering concatenation 
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Figure 3.8: After concatenation/triggering classing 
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Figure 3.9: After classing 
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concatenation rules to model these two sentences, we can create a single rule of the form 
A — > A2A1. Thus, adding a classing rule can enable us to create fewer rules to model a 
given piece of data. In this instance, the cost of creating the classing rule may offset the 
gain in creating one fewer concatenation rule; however, a classing rule can be profitable if 
it saves rules in more than one place in the grammar. 

In particular, whenever there are two triggered rules that differ by only one symbol, we 
consider classing the symbols that they differ by. Triggered rules that do not improve the 
objective function by themselves may become profitable after classing, as a class can reduce 
the number of new rules that need to be created. Recall that the prior on grammars assigns 
lower probabilities to larger grammars, so the fewer rules created, the better. 

In the above example, we consider building the class A2 — ► Asobl^Mary because we have 
two triggered concatenation rules that only differ in that ABob is replaced with AMary in 
the latter rule. To approximate how the best parse changes if a classing rule is created, 
we apply the classing rule wherever the associated triggering rules would be applied. In 
this example, if we create the classing rule the current parse would be updated to be as in 



Figure 3.S. 



However, notice that if there exists a rule A3 — > A^obl^ John, then by classing together A3 
and Aj^ary we can get the same affect as classing together A^ob and AMary- we will only need 
to create a single concatenation rule to describe the two sentences in the previous example. 
Similarly, if ^3 belongs to a class A4, then we can get the same effect by classing together 
An and AMary Thus, when two triggered rules differ by a symbol, instead of considering 
classing just those two symbols, we should consider classing each class that each of those 
two symbols recursively belong two. However, this is too expensive computationally. For 
example, if there are ten triggered rules of the form A — * A a Ai for ten different symbols 
A a and each A a belongs to ten classes on average, then there are roughly ( 10 2 10 ) ~ 5000 
pairs of symbols that we could consider classing. 

To address this issue, we use heuristics to reduce the number of classings we consider. 
In particular, for a set of triggered rules that only differ in a single position where the 
symbols occurring in that position are we try to find a minimal set of classes that 

all of the A a recursively belong to, and only try exhaustive pairwise classing among that 
minimal set. We use a greedy algorithm to search for this set; we explain this algorithm 
using an example. Consider the case where we have triggered rules of the form A — ► A a Ai 
for A a = {A Bo b,A 

johm A-Maryi Athe macaw, A a parrot, A a frog] ■ Initially, we take the minimal 
covering set to be just the set of all of the symbols: 

{■^Bob, -^-John, A Mary i -^-the macaw, A a parrot, A a frog} 

Then, we try to find classes that multiple elements belong to. Say we notice that there is 
an existing symbol A Bob \ John that expands to Asobl^john- We group these two symbols by 
replacing the two symbols with the new symbol: 

{■^■Bob\John, A-Mary, ^the macaw, A a parrot, A a frog} 

Using this new set, we again try to find symbols that multiple elements belong to. Let's 
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Figure 3.10: Triggering and applying the repetition move 
say we find an existing symbol A the macaw \ a parrot, giving us 

{A B ob\Johni AMary, A^ macaw \ a parrot, A a frog} 

and a symbol A the macaw \ a par rot\a frog giving us 

{ABob\John, AMary, Afhe raacaw\a parrot\a frog} 

Then, if we can find no more classes that multiple elements belong to, we take this to be the 
minimal covering set and consider all possible pairwise classings of these three elements. 

Notice that this algorithm attempts to find the natural groupings of the elements in the 
list as expressed through existing classes, and only tries to class together these higher-level 
classes. This is a reasonable heuristic in selecting new classings to consider. 

To constrain what groupings are performed, we only consider those groupings that are 
profitable in terms of the objective function. For instance, in the above example we only 
group together the symbols A Bo b and A j h n into the symbol A Bob | John if creating the rule 
A — > A Bob | j h n A\ is more profitable (or less unprofitable) in terms of the objective function 
than creating the rules A — > A Bo bA\ and A' — > A j h n A\ . 

Repetition Rules 

To make rules of the form A — ► AB\B, we look for multiple (i.e., at least three) adjacent 
instances of the symbol X expanding to the symbol B. For example, the parse on the left 



of Figure 3.10 triggers a rule of the form A — > AAh \Ah - To approximate how the best 
parse changes with the creation of a rule A — > AB\B, we replace all chains of at least three 
consecutive X's expanding to S's with a single X expanding to A. In this example, if the 
repetition rule is actually created, then we would estimate the best parse to be the parse 
on the right in Figure |3.10 . 
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Figure 3.11: Without smoothing rules 



S 
X 
A 2 



ABob A\ 
B°b Atalks A s i ow iy 

talks slowly 



S 
X 

k 



Asob A\ 
^0^ Atalks A s i ow i y 
talks A QU %ckly 

quickly 



Figure 3.12: With smoothing rules 



Smoothing Rules 

In this section, we describe the smoothing rules alluded to earlier. Just as smoothing 
improves the accuracy of n-gram models, smoothing can improve grammatical models by 
assigning nonzero probabilities to phenomena with no counts. More importantly, they cause 
text to be parsed in a way so as to provide informative triggers for producing new rules. 
For example, consider the case where the only concatenation rules in the grammar are 



Ax 
A 2 



Atalks A slowly 
ABobAi 



Then, if we see the sentences Bob talks slowly and Bob talks quickly, they will be parsed 
as in Figure p. 11 . While the concatenation rules can be used to parse the first sentence, 
they do not apply to the second sentence. However, on the surface these sentences are very 
similar and it is desirable to be able to capture this similarity. Consider adding the rule 
Asiowiy - > A auic ki y to the grammar. Then, we can parse the two sentences as in Figure 3.12 
capturing the similarity in structure of the two sentences. 
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Figure 3.13: e-smoothing rules 



Furthermore, we can use the parse on the right as a trigger. For example, we might 
consider creating the rules 



reflecting the similarity in structure between the two constructions. The rule A s i ow i y — > 
Aquickiy helps us capture the parallel nature of similar constructions in both the best parse 
and the grammar. 

In the grammar, we have a rule of the form A — > B for all nonterminal symbols A,B ^ 
S, X, and we call these smoothing rules. They are implicitly created whenever a new 
nonterminal symbol is created. We assign them very low probabilities so that they are used 
infrequently. They are only used in the most probable parse if without them few grammar 
rules can be applied to the given text, but with them many rules can be applied, as in the 
above example. This prevents smoothing rules from indicating a parallel nature between 
overly dissimilar constructions. 

In addition, we also have smoothing rules of the form A — > e for every nonterminal 
symbol i / S, X. These can capture the situation where two constructions are identical 
except that a word is deleted in one. We display possible parses of the sentences Bob talks 
slowly and Bob talks in Figure |3.13j . 

We assign probability ^fpc{B) to smoothing rules A —> B, and probability % to smooth- 
ing rules A — > e. We take the distribution pg(B) to be different from the probabilities p(B) 
associated with rules of the form X — > B. The probability p(B) reflects how frequently 
a symbol occurs in text, and it is unclear this is an accurate reflection of how frequently 
a symbol occurs in a smoothing rule. Instead, we guess that a better reflection of this 
frequency is the frequency with which a symbol occurs in the grammar, as smoothing rule 
occurrences trigger rule creations, these two quantities should correlate. Thus, we take 



where cg{B) is the number of times the symbol B occurs in grammar rules. The parameter 





S5 



X 



X 



A' 



ABob A\ 
B°b Atalks A s i ow iy 

talks slowly 



Asob A[ 

Bob Ataifcg Aquicfciy 

talks quickly 



Figure 3.14: After smoothing triggering 



p s is set arbitrarily; in most experiments, we used the value 0.01. 

When smoothing rules appear in the best parse V, they trigger rule creations in the 
way described in the examples given earlier in this section. In particular, we try to build 
the smallest set of concatenation rules such that the given text can be parsed without the 
smoothing rule. Thus, for the Bob talks quickly/Bob talks slowly example we try to build 
the symbols A\ and A' 2 defined earlier. To approximate how the best parse is affected, we 
just use the heuristics given for concatenation rules; for this example, this yields the parse 



in Figure 3.14. In Section 3.5.5, we describe other types of moves that we can trigger with 



smoothing rules. 

Notice that in creating multiple rules, we pay quite a penalty in the objective function 
from the term favoring smaller grammars. To make this move more favorable, we have 
added an encoding to our encoding scheme that describes these types of moves compactly. 



This is described in Section 3.5.2 



Specialization 

In the last section, we discussed a mechanism for handling the case where a symbol is too 
specific. For example, if we have a symbol that expands only to a string Bob talks slowly 
(ignoring smoothing rules), by applying smoothing rules this symbol can expand to all 
strings of the form Bob talks a. Furthermore, the application of a smoothing rule triggers 
the creation of rules that expand to these other strings. 

On the other hand, it may be possible that a symbol is too general. For example, 
consider the rules 

A\ ► A mouse \A ra t 
A 2 -► A the A x 

A3 > A2A S g Uea k s 

The symbol A3 expands to the strings the mouse squeaks and the rat squeaks, but suppose 
only the former string appears in text. We can create a specialization of the symbol A3 by 
creating the rules 

A2 * Aiy le A mouse 
A3 > A2A S q Uea j cs 
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Figure 3.15: Before and after specialization 





Figure 3.16: Before and after repetition specialization 



The new symbol A' 3 expands only to the string the mouse squeaks. Notice the similarity 
between this move and the move triggered by the smoothing rule; the only difference is 
that instead of being triggered whenever a smoothing rule occurs, this move is triggered 
whenever a classing rule occurs. Parses of the mouse squeaks before and after the creation 



of these specialization rules are displayed in Figure 3.15. Like with the smoothing rules, we 
try to create the minimal number of rules so that the given text can be parsed without the 
class expansion. Also like the smoothing rules, there is a special encoding in the encoding 
scheme to make these moves more favorable. 

It is also possible to specialize repetition rules. For example, consider the repetition rule 
A — > ABho\Bho expressing that the symbol A can expand to any number of Bho's. However, 
suppose that ho always occurs exactly three times. Then, it seems reasonable to create the 
rules 

A% —> BhoBho 



A" 



A'2B ho 
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so that we have a new symbol A'^ that expands exactly to the string ho ho ho. Parses 
of ho ho ho before and after the creation of these specialization rules are displayed in 



Figure 3.16. Such specializations are triggered whenever a repetition symbol occurs in V . 
They involve the creation of concatenation rules that generate the repeated symbol the 
appropriate number of times, like above. 

Summary 

We have moves in our move set for building concatenation, classing, and repetition grammar 
rules. These operations are the building blocks for regular languages, and as mentioned in 



Section 3.3.3 our algorithm can only create grammars that describe regular languages. This 
is because whenever we create a new rule, we create a new symbol to be placed on its 
left-hand side. In addition, we have no moves for modifying existing rules. Thus, it is 
impossible to introduce recursion into the grammar, except for the recursion present in the 
repetition rule. 

In addition, we have moves for generalizing and specializing existing symbols. Smoothing 
rules provide the trigger for creating symbols that generalize the set of strings existing 
symbols expand to. Classing and repetition rules provide the trigger for creating symbols 
that specialize the set of strings existing symbols expand to. 

While this forms a rich set of moves for constraining grammars, there are some obvious 
shortcomings. For example, there are no moves for modifying existing rules or deleting 
rules, or for changing the set of strings a symbol expands to. Also, there are very few free 
parameters in the grammar; we may be able to do better by allowing class expansions to 
have probabilities that are trained. In Section 3.5.5| , we describe extensions such as these 
that we have experimented with. 

3.5.3 Encoding 

In this section, we describe the encoding scheme that we use to describe grammars. This 
determines the length 1(G) of a grammar G, which is used to calculate the prior p(G) = 
2~ l ( G ) on grammars, which is a term in our objective function. 

While one can encode grammars using simple methods such as textual description, we 
argue that it is important to use compact encodings as touched on in Section |3.2.2| . First 



of all, we want the prior p(G) = 2~~^ G ) on grammars that is associated with an encoding 
1(G) to be an accurate prior; that is, p(G) should model grammar frequencies accurately. 
Just as good language models assign high probabilities to training data, good priors should 
assign high probabilities to typical or frequent grammars. This corresponds to assigning 
short lengths to typical grammars. 

Furthermore, the compactness of the encoding dictates how much data is needed as 
evidence to create a new grammar rule. To clarify, let us view the objective function from 
the MDL perspective, i.e., as l(G)+l(0\G), the length of the grammar added to the length of 
the data given the grammar, recalling that l(0\G) is simply log 2 p (o\g) • Adding a grammar 
rule increases the length 1(G) by some amount, say 5, so in order for a new grammar rule 
to improve the objective function its application must result in a decrease in the length of 
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the data of at least 6. The more compactly we can encode grammar rules, the smaller 5 will 
be, and the less a rule needs to compress the training data in order to be profitable. This 
corresponds to decreasing the amount of evidence necessary to induce a grammar rule, e.g., 
decreasing the number of times the symbols B and C need to occur next to each other to 
make the creation of the rule A — * BC profitable. 

Before we describe the encoding proper, we first describe how we encode positive integers 
with no upper limit, such as the number of symbols in the grammar. One option is to just 
set an arbitrary bound and to use a fixed-length code, e.g., to code integers using 32 bits 
as in a programming language. However, it is inelegant to set a bound, and this encoding 
is inefficient for small integers. Instead, we use the encoding associated with the universal 
MDL prior over the natural numbers pmdl(^) ( [Rissanen, 1989| ) mentioned in Section 3.5.1 
where 

6 1 

PMDL(n) 



7T 2 n[log 2 (n + l)] 2 

We take the length l(n) of an integer n to be log 9 / This assigns shorter lengths to 

6 V I to p MDL (n) & & 

smaller integers as is intuitive; in addition, it assigns as short lengths as possible asymptot- 
ically to large integers. 

We now describe the encoding. Recall that we are only concerned with description 
lengths, as opposed to actual descriptions; thus, we only describe lengths here. The encoding 
is as follows: 

• First, we encode the list of all terminal symbols. How this is done is not important 
assuming the size of this list remains constant. We are only concerned with changes 
in grammar size, as we are only concerned with calculating changes in the objective 
function. 

• Then, we encode the number n s of nonterminal symbols excluding S and X using the 
universal MDL prior. 

• For each nonterminal symbol A ^ S, X, we code the following: 

— We code the count c(A) (using the universal MDL prior) used to calculate p(A) = 
TT ^r~A\ i the probability associated with the rule X — > A. 

— We code the type of the symbol, e.g., whether it is a concatenation rule, a classing 
rule, or a repetition rule. In all there are eight types (some of which we have yet 
to describe), so we use three bits to code this. 

— For each rule type, we have a different way of coding the right-hand side of the 
rule, which will be described below. 

The symbol on the left-hand side of each rule is given implicitly by the order in which the 
symbols are listed. That is, the first symbol listed is A\, the second A2, and so on up to 
A ris . Notice that each symbol expands using exactly one of eight possible forms; we do not 
have to consider listing multiple rules for a given symbol. 
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quickly 

Figure 3.17: Smoothing rules 

We do not have to list the rules expanding S or X because they can be determined 
implicitly from the list of nonterminal symbols. Likewise, all smoothing rules can be de- 
termined implicitly. Furthermore, all probabilities associated with the grammar can be 
determined from just the form of the grammar, except for the smoothing probability p s and 
the probabilities p(A) associated with the rules X — > A. The probabilities p(A) are coded 
explicitly above. We assume the probability p s is of constant size. 

Below, we describe the eight different rule types and how we code the right-hand side 
of each. 

expansion to a terminal (A — > a) Since none of these rules are created during the search 
process, it is not important how we code these rules assuming their size is constant. 
Recall that we are only concerned with changes in grammar length. 

concatenation rule (A — > BC) We restrict concatenation rules to have exactly two sym- 
bols on the right-hand side, so we need not code concatenation length; we need only 
code the identities of the two symbols. We constrain these two symbols on the right- 
hand side to be nonterminal symbols; this is not restrictive since there is a nonterminal 
symbol expanding exclusively to each terminal symbol. As there are n s nonterminal 
symbols, we can use log 2 n s bits to code the identity of each of the two symbols. In 
all rule types, to code a symbol identity we use log 2 n s bits. 

classing rule (A — > B\C) Again, we need only code two symbol identities, and we code 
each using log 2 n s bits. 

repetition rule (^4 —* AB\B) We need only code the identity of the B symbol to uniquely 
identify this type of rule, and we code this using log 2 n s bits. 

derived rule This is the encoding tailored to the move triggered by the application of a 
smoothing rule, as described in Section 3.5.2| . We can identify the set of rules that are 



created in such a move with the following information: the symbol underneath which 
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the smoothing rule was applied, the location of the application of the smoothing rule, 
and the symbol the smoothing rule expanded to. For instance, consider the example 
given in Section (3.5.2 ; we re-display this in Figure 3.17. Originally, the following rules 



exist: 

A\ > At a ik s A s l ow ly 

A 2 -» A Boh A x 

and the smoothing rule triggers the creation of the following rules: 

A\ *■ At a ik s A quickly 

A' 2 -» A Boh A\ 

We can describe the new symbol A' 2 as follows: it is just like the symbol A 2 , except 
where A 2 expands to A s i ow i y , A 2 expands to A qu i c ki y instead. The definition of A\ can 
implicitly be determined from this definition of A' 2 . Thus, to encode A' 2 , we need to 
encode A 2 , the location of A s i ow i y , and the symbol A quic ki y . To code the two symbols, 
we use log 2 n s bits for each as before. To code the location, in Figure [3.17 we see that 



there are five internal nodes in the parse tree headed by A 2 (if no smoothing rules 
are applied). A smoothing rule can be applied at any of these five nodes. Thus, we 
need log 2 5 bits to code the location of the application of the smoothing rule. This 
size will vary with the symbol under which the smoothing rule occurs. In general, we 
need a total of 2 log 2 n s + log 2 ( # locations) bits to code the set of rules triggered by 
a smoothing rule. We call this type of encoding a derived rule, since it derives the 
definition of one symbol from the definition of another. 

deletion-derived rule This is identical to the derived rule just described, except that it 
corresponds to the application of an e-smoothing rule A —* e instead of a regular 
smoothing rule A — > B. In this case, we define a new symbol as equal to an existing 
symbol except that one of its subsymbols is deleted (i.e., is replaced with e). To code 
a deletion-derived rule, we just need to code the original symbol (log 2 n s bits) and 
the location of the deletion (log 2 ( # locations) bits) ; we do not have to code a second 
symbol. 

specialization rule This can be viewed as identical to the derived rule, except that it 
corresponds to the application of a classing rule as opposed to a smoothing rule. We 
will define a new symbol as equal to an existing symbol, except that instead of re- 
placing a subsymbol with an arbitrary symbol as in a derived rule, we replace the 
subsymbol with a symbol that the subsymbol expands to. For instance, consider the 
example given in Section 3.5.2; we re-display this in Figure |3.18| . We can define the 



symbol A' 3 to be equal to A3, except that the symbol A± is replaced with the symbol 
A m0U se- This rule can be coded in the same way as a derived rule. However, we have 
an additional constraint not present with derived rules: we know that the symbol 
that is used to replace the original symbol at a given location is a specialization of 
the original symbol. That is, the original symbol expands to the replacing symbol via 
some number of classing rules. We can code the replacing symbol taking advantage 
of this observation; if the original symbol at the given location expands to a total of 



91 





A-the A 7 



the mouse 



A S queaks 

squeaks 



Figure 3.18: Before and after specialization 

c different symbols via classing, then we can code the replacing symbol using log 2 c 
bits. Furthermore, we also have a constraint on the location not present in derived 
rules: we know that the original symbol at that location must be a classing sym- 
bol. Thus, instead of coding the location with log 2 (# locations) bits, we can code it 
with log 2 (# class locations) bits. In summary, we can code specialization rules using 
log 2 n s + log 2 (# class locations) + log 2 c bits. 

repetition specialization rule This is identical to the specialization rule, except instead 
of dealing with classing rules it is concerned with repetition rules. Notice that a 
repetition rule A — ► AB\B can just be viewed as a classing rule of the form A — > 
B\BB\BBB\ "'. Thus, this rule can be coded in a similar manner to a regular 
specialization rule. Instead of coding the location using log 2 (# class locations) bits, 
we code it using log 2 (# repeat locations) bits. Instead of coding the replacing symbol 
using log 2 c bits, we code the number of repetitions using the universal MDL prior. 



3.5.4 Parsing 

To calculate the most probable parse of a sentence given the current hypothesis grammar, 
we use a probabilistic chart parser ( [Younger, 1967 ; Jelinek et al., 19"92| ). In chart parsing, 
one fills in a chart composed of cells, where each cell represents a span in the sentence to 
be parsed. If the sentence is composed of the words w\ ■ ■ ■ w m , then there is a cell for each 
i and j such that 1 < i < j < m corresponding to the span mj ■ • • Wj. Each cell is filled 
with the set of symbols that can expand to the associated span Wi---Wj. For example, 
if the sentence is accepted under the grammar, then the symbol S will occur in the cell 
corresponding to w\ ■ ■ ■ w m . The cells can be filled in an efficient manner with dynamic 
programming ( Bellman, 1957] ) . Performing probabilistic chart parsing just requires some 
extra bookkeeping; the algorithm is essentially the same.p°| 



20 This is only true when trying to calculate the most probable parse of a sentence. In some applications, 
one attempts to find the total probability of a sentence, which involves summing the probabilities of all of its 
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Figure 3.19: Typical parse-tree structure 

However, straightforward parsing is not efficient given that we have smoothing rules of 
the form A —* B and A — > e for all nonterminal symbols A,B / S, X. With these rules, it 
is possible for any symbol to expand to any span of a sentence; each cell in the chart will 
be filled with every symbol in the grammar. Consequently, naive parsing with smoothing 
rules achieves the absolute worst-case time bounds for chart parsing. This is unacceptable 
in this application. 

Instead, we have heuristics for restricting the application of smoothing rules. We first 
parse the sentence without using any smoothing rules. This yields a parse of the form 



displayed in Figure |3.19| . Then, we make the assumption that applying smoothing rules to 
the structure below the A{ in the diagram is not profitable; smoothing rules improve the 
probability of a parse the most if they enable grammar rules to apply in places where none 
applied before. We take the A4 to be the primitive units in the sentence, and only allow 
smoothing rules to be applied immediately above these units. We re-parse the sequence 
A1A2 ■ ■ ■ given these smoothing rules, yielding a new best parse. We repeat this process 
with this new best parse, until the best parse is unchanging. At some point, smoothing 
rules will not affect the most probable parse. 



3.5.5 Extensions 

After completing the algorithm described in the previous part of this section ( |Chen, 1995 ), 
which we will refer to as the basic algorithm, we experimented with different extensions 
to arrive at what we refer to as the extended algorithm. In this section, we describe the 
differences between the basic and extended algorithms. 

parses; in this case probabilistic chart parsing is somewhat more involved than normal chart parsing because 
of the difficulty in calculating probabilities when some types of recursions are present in the grammar. 
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Concatenation 



In the basic algorithm, we restrict concatenation rules A — > A\A% to have exactly two 
symbols on the right-hand side. In the extended algorithm, we allow an arbitrary number 
of symbols A — > A\Ai ■ ■ However, we do not enhance the move set by adding a move that 
concatenates arbitrary /c-tuples instead of just pairs, as the bookkeeping and computation 
required for this would be very expensive. Instead, we look for opportunities to unfold 
shorter concatenation rules. For example, if we have two rules 

At -> A 2 A 3 
A 3 -» AiA 5 

and we notice that the symbol ^3 occurs nowhere else in the grammar, then we can replace 
this pair of rules with the single rule 

A X -» A 2 A 4 A 5 

We can unfold the definition of A3 into the first rule to yield the longer concatenation rule. 
By allowing long concatenations, we make it possible to express new concatenations more 
compactly, which is advantageous with respect to the objective function. 

To handle this extension in our encoding scheme, instead of just needing to code two 
symbol identities as in the basic algorithm, we first encode the number of symbols k in the 
concatenation using the universal MDL prior. Then, we encode the k symbol indentities. 



Classing 

In the basic algorithm, we restrict classing rules A — > to have exactly two symbols on 

the right-hand side. In the extended algorithm, we allow an arbitrary number of symbols 
A — > Ax\A2\ ■ ■ ■■ To support this representation, we add a new move to the move set. In 
the basic algorithm, the only classing move is to create a new rule A — > Ax\A%. In the 
extended algorithm, we also allow the move of adding a new member to an existing class. 
Both of these moves fulfill the purpose of placing symbols into a common class; the one 
that is chosen in a particular situation is determined by which is preferred by the objective 
function. 

With this new move, it becomes possible to build grammars that express arbitrary 
context-free languages, instead of just regular languages. In the basic algorithm, it is 
impossible to create recursion (except in the repetition rule) because symbols can only be 
defined in terms of symbols created earlier in the search process. Using this new move, 
we can place into the definition of a symbol a symbol created subsequently, thus enabling 
recursion. 

Another difference in the extended algorithm is that we train the probabilities associated 
with classing rules. In the basic algorithm, for a classing rule A — ► Ai|^2 we set the 
probabilities p(A — > Aj) of A expanding to each Ai in a deterministic way depending only 
on the form of the grammar as described in Section |3,5.2 , In the extended algorithm, we 
train the probabilities p(A — > Ai) by counting the frequency of each reduction A — > A^ in 
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the current best parse V of the training data. We take 

c A (A) 



p(A -» At 



(ignoring the factor of 1— p s for handling smoothing rules) where ca{Ai) denotes the number 
of times the reduction A — ► Ai is used in V. By training class probabilities, it should be 
possible to build more accurate models of the training data. 

In the encoding, instead of just needing to code two symbol identities as in the basic 
algorithm, we first encode the number of symbols k in the class using the universal MDL 
prior. Then, we encode the k symbol indentities. We need also to encode the counts CA,{Ai) 
for each A4; we do this by first coding the total number of counts ca = J2i=i ca{Ai) using 
the universal MDL prior. There are ( CA ii _1 ) possible ways to distribute ca counts among 
k elements f 1 ] thus, we need log 2 ( CA ^i l ) bits to specify the values ca(Ai) given ca and k. 

Smoothing Rules 

There are two modifications we make with respect to smoothing rules. First, we add 
insertion smoothing rules, which can be thought of as the complement of deletion or e- 
smoothing rules. While deletion rules enable a symbol that expands to the string Bob talks 



slowly to also parse the string Bob talks as described in Section |3.5.2| , insertion rules allow 
the converse: they enable a symbol that expands to Bob talks to also parse the string Bob 
talks slowly. 

Insertion rules are of the form A — > AB for all nonterminal symbols A,B ^ S, X. Such 
a rule "inserts" a B immediately after an A. The probability associated with this rule is 
^pq(B) where pc(B) is defined as in Section |3.5.2| , and the probability of smoothing rules 
A — ► B and A — ► e are reduced to ^pc(B) and respectively. The occurrence of an 
insertion rule A — > AB triggers the creation of a concatenation rule concatenating A and 
B. 

The second modification deals with the moves that are triggered by the occurrence of 
a smoothing rule. Because the extended algorithm contains a move for adding symbols to 
classes instead of just a move for creating classes, in the extended algorithm smoothing 
rules can trigger a move we could not consider in the basic algorithm. Consider the rules 

^0 ► A quickly\A slowly 

A\ — > A talks A 
A 2 -> A Boh A x 

defining a symbol A2 that expands to the strings Bob talks quickly and Bob talks slowly. 
Now, consider encountering a new string Bob talks well that we parse using the symbol A2 



21 To see this, consider a row of ca white balls. By inserting k — 1 black balls into this row of white balls, we 
can represent a partitioning of the ca white balls into k bins: the black balls separate each pair of adjacent 
bins. Each of these partitionings correspond to a different way of dividing ca counts among k elements. The 
total number of partitionings is equal to the number of different ways of placing the k — 1 black balls among 
the total of ca + k - 1 balls in the row, or ( CA fc t_V 1 ) • 
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Figure 3.20: Generalization 



and the smoothing rule Aq — > A we u, as displayed in Figure 3,20 , In the basic algorithm, 
this triggers the possible creation of the rules 

A'i — > At a ik s A we n 
A' 2 -> ^Bofe^i 

describing a symbol that expands just to the string Bob talks well. However, with the 
ability to add new members to classes, we can instead just trigger the addition of A we u to 
the class Aq yielding 

A() ► Aq U j iC f C ly\A S l 0W ly\A We ll 

Instead of having a symbol A 2 expanding to Bob talks quickly and Bob talks slowly and a 
symbol A' 2 expanding to Bob talks well, this new move yields a single symbol A 2 expanding 
to all three strings. This is intuitively a more appropriate action, and results in a more 
compact grammar. 



Grammar Compaction 

In the basic algorithm, the move set consists entirely of moves that create new rules, i.e., 
moves that expand the grammar (and hopefully compress the training data). In the ex- 
tended algorithm, we consider moves that compact the grammar (and leave the data the 
same size or perhaps even enlarge the training data). These moves help correct for the 
greedy nature of the search strategy, by providing a mechanism for reducing the number of 
grammar rules. For example, if we have the following grammar rules 



At - 


~* At a ik s A slowly 


A 2 - 


-* AsobAi 


A[ - 


Afall-gAqyackiy 


A' 2 - 


-> ABobAi 
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it may be profitable to replace the above rules with the rules 

Aq ► A quickly\A slowly 
A'{ —> AtalhsAf) 

A" — > A n l A" 

More generally, we search for rules that differ by a single symbol, and attempt to merge 
these rules. 

To constrain what rules we consider merging, we only consider merging rules that expand 
symbols that occur in a common class. This constraint is necessary because there are many 
constructions that are on the surface very similar that have different meanings. For example, 
the strings a can and John can differ by only one word, but are unrelated in meaning. We 
restrict rule merging to symbols that are in a common class because hopefully symbols that 
have been placed in the same class are semantically related. 

In general, for any class A — > A\\A2 \ ■ ■ ■, we attempt to create new symbols merging 
multiple Ai whose definitions differ by only a single symbol. Whenever we make such 
a symbol, we substitute it into the right-hand side of the classing rule. If through this 
merging we yield a classing rule A — > B with only a single symbol on the right-hand side, 
we merge the two symbols A and B into a single symbol. 



Symbol Encoding 

In the basic algorithm, we encode symbol identities using log 2 n s bits, where n s is the total 
number of nonterminal symbols. However, recall that coding theory states that a fixed- 
length coding such as this is an optimal coding only if all symbols are equiprobable; in 
general, a symbol with frequency p should be coded with log 2 ~ bits. Intuitively, it seems 
reasonable to code rules involving frequent symbols such as A t h e with fewer bits than rules 
involving rare symbols such as A hivvovotamus . 

We calculate the frequency pc(A) of each symbol A in the grammar as yy~~^4y> where 
cg{A) is the number of times the symbol A occurs in the grammar. We code a symbol A 
using log 2 PG (^) bits. However, notice that we will not know cg(A) until the end of the 
grammar description, but these values are needed to code the grammar. To resolve this 
problem, we explicitly code the values of cg{A) for all A before we code the grammar rules. 
We first code cq = c g(^4)> the total number of symbol occurrences in the grammar, using 
the universal MDL prior. Then, there are ( 6G r ^ ns 1 ~ 1 ) ways to distribute cq counts among n s 
elements, so we just need log 2 ( CG ^"L"i _1 ) kits to code the values of cg{A) given cq and n s .0 

Maintaining the Best Parse 

As the move set grows in complexity, it becomes more difficult from an implementational 
standpoint to maintain an accurate estimate of the most probable parse V . Furthermore, 
the amount of memory needed to store V grows linearly in the training data size, so for 

22 Using an adaptive coding method, it may be possible to encode symbols even more compactly. However, 
the gain in compactness probably does not warrant the additional complexity of implementation. 
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S X 

X ii 

ABob A ta i ks A s i ow i y 

Bob talks slowly 



Figure 3.21: Before and after concatenation 

large training sets it may be impractical to store V entirely in memory, as is necessary for 
good performance. In the extended algorithm, instead of keeping track of V explicitly, we 
just estimate the counts of all salient events in V . 

For example, to calculate whether it is profitable to create a concatenation rule A — > BC, 
we need to know the number of times two adjacent X's expand to the symbols B and C, 
respectively. Let us call this quantity c(B,C). We need to keep track of counts c(B,C) 
for all symbols B and C. Likewise, there are similar counts that we need to keep track of 
for the other types of moves. In the extended algorithm, we just record these counts as 
opposed to the actual parse V. 

Unfortunately, while less expensive this system is also less accurate. To update counts 
when new sentences are parsed is straightforward; however, errors are introduced when- 



ever we apply a move to previous sentences. For example, consider Figure |3.21| depicting 
the parse tree of Bob talks slowly before and after the creation of the concatenation rule 
Ai — > A ta ik s A s i 0W i y . After the move, we know to set c(A ta ik s , A s [ ow i y ) to zero as the concate- 
nation rule will be applied at each relevant location; however, it is more difficult to update 
overlapping counts. For example, for the sentence in the example we should decrement 
c(ABob, Ataiks) and increment c(AB b, A±). If we maintain V explicitly, this is straightfor- 
ward; however, if we are only maintaining counts on V, the necessary information to update 
these counts correctly is generally not available (unless enough counts are available to com- 
pletely reconstruct V). We use heuristics to update counts as best we can given available 
information. 



3.6 Results 

To evaluate our algorithm, we compare the performance of our algorithm to that of ra-gram 
models and the Lari and Young algorithm. 

For n-gram models, we tried n = 1, . . . , 10 for each domain. To smooth the n-gram 
models, we use a popular version of Jelinek-Mercer smoothing ( Jelinck and Mercer, 1980| 
Bahl et ai, 1983"| ), namely the version that we refer to as interp-held-out described in 
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Section 2.4.1. 



In the Lari and Young algorithm, the initial grammar is taken to be a probabilistic 
context-free grammar consisting of all Chomsky normal form rules over n nonterminal 
symbols {X\, . . . X n } for some n, that is, all rules 

Xi -> Xj X k i,j, k e {1, . . . ,n} 
Xi — > a i G {1, . . . , n}, a G T 

where T denotes the set of terminal symbols in the domain. All rule probabilities are 
initialized randomly. From this starting point, the Inside-Outside algorithm is run until the 
average entropy per word on the training data changes less than a certain amount between 
iterations; in this work, we take this amount to be 0.001 bits. 

For smoothing the grammar yielded by the Lari and Young algorithm, we interpolate the 
expansion distribution of each symbol with a uniform distribution; that is, for a grammar 
rule A ^ a we take its smoothed probability p s {A — > a) to be 

p s (A^a) = {1 - X) Pb (A ^ a) + X ' 



n s + n\T\ 



where Pb{A — > a) denotes its probability before smoothing. The value n 3 + n\T\ is the 
number of rules expanding a symbol under the Lari and Young methodology. The parameter 
A is trained through the Inside-Outside algorithm on held-out data. This smoothing is also 
performed on the grammar yielded by the Inside-Outside post-pass of our algorithm. For 
each domain, we tried n = 3, . . . , 10. 

Because of the computational demands of our algorithm, it is currently impractical to 
apply it to large vocabulary or large training set problems. However, we present the results 
of our algorithm in three medium-sized domains. In each case, we use 4500 sentences for 
training, with 500 of these sentences held out for smoothing. We test on 500 sentences, and 
measure performance by the entropy of the test data. 

In the first two domains, we created the training and test data artificially so as to have 
an ideal grammar in hand to benchmark results. In particular, we used a probabilistic 
context-free grammar to generate the data. In the first domain, we created this grammar 



by hand; this simple English-like grammar is displayed in Figure 3.22. The numbers in 
parentheses are the probabilities associated with each rule; rules without listed probabilities 
are equiprobable. In the second domain, we derived the grammar from manually parsed 
text. From a million words of parsed Wall Street Journal data from the Penn treebank, we 
extracted the 20 most frequently occurring symbols, and the 10 most frequently occurring 
rules expanding each of these symbols. For each symbol that occurred on the right-hand 
side of a rule that was not one of the most frequent 20 symbols, we created a rule that 
expanded that symbol to a unique terminal symbol. After removing unreachable rules, this 
yielded a grammar of roughly 30 nonterminals, 120 terminals, and 160 rules. Parameters 
were set to reflect the frequency of the corresponding rule in the parsed corpus. 

For the third domain, we took English text and reduced the size of the vocabulary by 
mapping each word to its part-of-speech tag. We used tagged Wall Street Journal text from 
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NP - 
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-> VO 




(0.2) 




VI 
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(0.4) 




V2 


NP NP 


(0.2) 




VP 


PP 


(0.2) 


PP - 


-» P 


NP 


(1.0) 



D —* a | t/ie 

N — > car | 6us | 6oy | oiH 

PN — > Joe | John \ Mary 

P — ► on \ at \ in \ over 

VO — > cried \ yelled \ ate 

VI — > hit | slapped \ hurt 

V2 — > gcwe | presented 



Figure 3.22: Sample grammar used to generate data 





best 
n 


entropy 
(bits/word) 


entropy relative 
to n-gram 


ideal grammar 




2.30 


-6.5% 


extended algorithm 


8 


2.31 


-6.1% 


basic algorithm 


7 


2.38 


-3.3% 


n-gram model 


4 


2.46 




Lari and Young 


9 


2.60 


+5.7% 



Table 3.2: English-like artificial grammar 



the Penn treebank, which has a tag set size of about fifty. To reduce computation time, we 
only used sentences with at most twenty words. 

In Tables 3^-3^, we summarize our results. The ideal grammar denotes the grammar 
used to generate the training and test data. The rows basic algorithm and extended al- 
gorithm describe the two versions of our algorithm. For each algorithm, we list the best 
performance achieved over all n tried, and the best n column states which value realized this 
performance. For n-gram models, n represents the order of the n-gram model; for the other 
algorithms, n represents the number of nonterminal symbols used with the Inside-Outside 
algorithm.PI In Figures 3.23 - 3. 21 , we show the complete data, the performance of each 
algorithm at each n for each of the three domains. 

We achieve a moderate but significant improvement in performance over n-gram models 



Th e data presen ted in Tables 3.2-3.4 differ slightly from the corresponding data presented in an earlier 
paper (Chen, 1995). Part of this difference is due to the fact that we used a different random starting 
point for the Inside-Outside post-pass of our algorithm. In addition, we used a different data set for the 
part-of-speech domain. 
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best 


pntronv 

uuy 


pntrnrw rplativp 




n 


(bits/word) 


to n-gram 


ideal grammar 




4.13 


-10.4% 


extended algorithm 


7 


4.41 


-4.3% 


basic algorithm 


9 


4.41 


-4.3% 


n-gram model 


4 


4.61 




Lari and Young 


9 


4.64 


+0.7% 



Table 3.3: Wall Street Journal-like artificial grammar 
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n-gram model 
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3.00 




basic algorithm 
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3.12 


+4.0% 


extended algorithm 


7 


3.13 


+4.3% 


Lari and Young 


9 


3.60 


+20.0% 



Table 3.4: English sentence part-of-speech sequences 
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Figure 3.23: Performance versus model size, English-like artificial grammar 



101 



5.6 



5.4 



5.2 



4.6 



4.4 



4.2 



A 




i 












^.Jari/young 












i-grarA 
















&-N-. basic 
o "X""8-- -- 














extended 


— $ 

' - -H 






'**Xt3 

x 




O =1 


deal 



3 4 5 6 7 8 

n (model order for n-gram/no. nonterminals for 10) 



10 



Figure 3.24: Performance versus model size, WSJ-like artificial grammar 
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Figure 3.25: Performance versus model size, part-of-speech sequences 
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A\ =4>* (on\over) John 
A 2 in (Mary\the girl) 

A 3 a bus 

A/± =^>* (the boy\ Joe\a car\Mary\a girl\a boy) at 

A§ =4>* (a girl\a boy\the bus\Mary\the girl\John\the car) (hurt\slapped\hit) the car 
Aq (the bus\Mary\the girl) cried 

: (13 prepositional phrases) 

A20 (a girl\a boy\the bus\Mary\the girl\John\the car) slapped (the boy\a bus) 

A21 at John 

A22 (o girl\a boy\the bus\Mary\the girl\John\the car) (hurt\slapped\hit) the bus 



Figure 3.26: Expansions of symbols A with highest frequency p(A) 

and the Lari and Young algorithm in the first two domains, while in the part-of-speech 
domain we are outperformed by n-gram models but we vastly outperform the Lari and 
Young algorithm. 

Comparing the two versions of our algorithm, we find that the extended algorithm 
performs significantly better than the basic algorithm on the English-like artificial text, 
and performs marginally better for most n on the WSJ-like artificial text. In the part-of- 
speech domain, the two algorithms perform almost identically for most n. 

In Figure 3,26] , we display a sample of the grammar induced in the English-like artificial 



domain. The figure displays the expansions of the symbols A with the highest probabilities 
p(A), which is proportional to how frequently the reduction X — > A is used. In some 
sense, these symbols are the most frequently occurring symbols. Each row corresponds to a 
different symbol, listed in decreasing frequency starting from the most frequent symbol in 
the grammar. The expression displayed expresses all possible strings the symbol expands 
to; it does not reflect the actual grammar rules with that symbol on the left-hand side. For 
example, the most frequent symbol in the grammar expands to the strings on John and over 
John, and the fifth-most frequent symbol expands to the strings a girl hurt the car, a boy 
hurt the car, etc. Except for the fourth symbol, all of these symbols expand to strings that 
are constituents according to the original grammar. Thus, in this domain our algorithm is 
able to capture some of the structure present in the original grammar. 



In Figure 3.27, we display the grammar induced by the Lari and Young algorithm with 
nine nonterminal symbols in the English-like artificial domain. We display only those rules 
with probability above 0.01; rule probabilities are shown in parentheses. Unlike in Figure 



3.26j where we list all strings a symbol expands to, in this figure we list the most frequent 
rules a symbol expands with. This grammar does a reasonable job of grouping together 
similar terminal symbols. However, it does less well at recognizing higher-level structures in 
the grammar. Most symbols in the induced grammar do not match well with the symbols 
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Ai -» hit (0.08) | hurt (0.08) | s/apped (0.09) | presented (0.06) | ^aue (0.06) 

| over (0.12) | on (0.12) | m (0.12) | at (0.12) 

| A 2 (0.03) | Ai A 8 (0.04) | A 5 A x (0.05) 

A 2 -» Mary (0.07) | Joe (0.24) | Jo/m (0.09) 

| A 2 A 5 (0.04) | A 4 A 3 (0.12) | A 4 A 6 (0.42) 

A 3 5 iW (0.03) | car (0.42) | boy (0.31) | 6us (0.15) 

| A 3 A 5 (0.03) | A 6 A 5 (0.05) 

A 4 the (0.50) | a (0.50) 

A 5 -> ye//ed (0.04) | cried (0.04) | aie (0.04) 

| Ax A 2 (0.21) | A x A 7 (0.19) | A : A 8 (0.23) | Ai A 9 (0.21) | A 5 A 5 (0.04) 

A 6 -> ffirZ (0.26) | car (0.23) | 6oy (0.24) | bus (0.26) 

A 7 -> Mary (0.09) | Joe (0.06) | John (0.19) 

| A 2 A 5 (0.02) | A 4 A 3 (0.03) | A 4 A 6 (0.55) | A 7 A 5 (0.02) | A 8 A 5 (0.04) 

A 8 -> Mary (0.17) | Joe (0.07) | John (0.09) 

| A 4 A 3 (0.04) | A 4 A 6 (0.58) | A 8 A 5 (0.05) 

A 9 -> A 2 A 5 (0.24) | A 7 A 5 (0.27) | A 8 A 5 (0.37) | A 9 A 5 (0.08) | A 9 A 7 (0.03) 



Figure 3.27: Grammar induced with Lari and Young algorithm 



WSJ-like 


n 


entropy 


no. 


time 


artificial 




(bits/word) 


params 


(sec) 


n-gram model 


3 


4.61 


15000 


50 


Lari and Young 


9 


4.64 


2000 


30000 


basic alg. /first pass 






800 


1000 


basic alg. /post-pass 


5 


4.56 


4000 


5000 



Table 3.5: Number of parameters and training time of each algorithm 



in the original grammar. For example, the symbol A3 groups together nouns with nouns 
followed by a verb taking no arguments. Hence, we see that our algorithm is clearly better 
than the Lari and Young algorithm at capturing relevant structure. 

In Table |3.5| , we display a sample of the number of parameters and execution time (on 
a Decstation 5000/33) associated with each algorithm. We choose n to yield approximately 
equivalent performance for each algorithm. The first pass row refers to the main grammar 
induction phase of our algorithm, and the post-pass row refers to the Inside-Outside post- 
pass. 

Notice that our algorithm produces a significantly more compact model than the n-gram 
model, while running significantly faster than the Lari and Young algorithm even though 
both algorithms employ the Inside-Outside algorithm. Part of this discrepancy is due to 
the fact that we require a smaller number of new nonterminal symbols to achieve equivalent 
performance, but we have also found that our post-pass converges more quickly even given 
the same number of nonterminal symbols. 

In Figures 3.28 and 3.2£, we display the execution time and memory usage of the main 
grammar induction algorithm on various amounts of training data. Both of these graphs 
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Figure 3.28: Execution time versus training data size 




5000 10000 15000 20000 25000 30000 35000 40000 

training data (tokens) 



Figure 3.29: Memory usage versus training data size 
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are nearly linear .f^] 



3.7 Discussion 

Our algorithm consistently outperformed the Lari and Young algorithm in these experi- 
ments. One perspective of this result can be taken noticing that both algorithms use the 
Inside-Outside algorithm as a final step. As the Inside-Outside algorithm is a hill-climbing 
algorithm, it can be interpreted as finding the nearest local minimum in the search space 
to the initial grammar. In the Lari and Young framework, this initial grammar is chosen 
essentially randomly. In our framework, this initial grammar is chosen intelligently using a 
Bayesian search. Thus, it is not surprising our algorithm outperforms the Lari and Young 
algorithm. 

From a different perspective, the main mechanism of the Lari and Young algorithm for 
selecting grammar rules is the hill-climbing search of the Inside-Outside algorithm. If we 
view this in terms of a move set, the basic move of the Lari and Young algorithm is to 
adjust rule probabilities. In our algorithm, we use a rich move set that corresponds to 
semantically meaningful notions. We have moves that can explicitly create new rules and 
nonterminal symbols unlike Lari and Young, moves that express concepts such as classing, 
specialization, and generalization. While both algorithms employ greedy searches and can 
thus be interpreted as finding nearby local minima, our results demonstrate that by using 
a richer move set this constraint is much less serious. There have been several results 
demonstrating the severity of the local minima problem for the Inside-Outside algorithm 
(|Chen et al, 1993| ; |de Marcken, 1995|) . 



In terms of efficiency, the algorithms differ significantly because of the different ways 
rules are selected. In the Lari and Young algorithm, one starts with a large grammar and 
uses the Inside-Outside algorithm to prune away unwanted rules by setting their proba- 
bilities to near zero. This approach scales poorly because the grammar worked with is 
substantially larger than the desired target grammar, and using large grammars has a great 
computational expense. It is impractical to use more than tens of nonterminal symbols with 
the Lari and Young approach because the number of grammar rules is cubic in the number 
of nonterminal symbols. 

In contrast, our algorithm begins with a small grammar and adds rules incrementally. 
The working grammar only contains rules perceived as worthy and is thus not unnecessarily 
large. In our algorithm, we can build grammars with hundreds of nonterminal symbols, and 
still the execution time of the algorithm is dominated by the Inside-Outside post-pass. 

Outperforming n-gram models in the first two domains demonstrates that our algorithm 
is able to take advantage of the grammatical structure present in data. For example, unlike 



24 The jump in the graph of memory usage betwen 30,000 and 35,000 tokens is an artifact of the training 
data used to produce the graph. At this point in the training data, there are many long sequences of a single 
token. It turns out that the number of possible ways to parse a long sequence of a single token is very large. 
For example, if we denote the repeated token as a, then any symbol that can expand to a k for some k can 
be used to parse any substring of length k in the long sequence of a's. The jump in memory is due to the 
additional memory needed to store the parse chart. 
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n-gram models our grammar can express classing and can handle long-distance dependen- 
cies. However, the superiority of n-gram models in the part-of-speech domain indicates that 
to be competitive in modeling naturally-occurring data, it is necessary to model colloca- 
tional information accurately. We need to modify our algorithm to more aggressively model 
n-gram information. 

3.7.1 Contribution 

This research represents a step forward in the quest for developing grammar-based language 
models for natural language. We consistently outperform the Lari and Young algorithm 
across domains and outperform n-gram language models in medium-sized artificial domains. 
The algorithm runs in near-linear time and space in terms of the amount of training data 
and the grammars induced are relatively compact, so the algorithm scales well. 

We demonstrate the viability of the Bayesian approach for grammar induction, and we 
show that the minimum description length principle is a useful paradigm for building prior 
distributions for grammars. A minimum description length approach is crucial for methods 
that do not limit the size of grammars; this approach favors smaller grammars, which is 
necessary for preventing overfitting. 

We describe an efficient framework for performing grammar induction. We present 
an algorithm that parses each sentence only once, and we use the concept of triggers to 
constrain the set of moves considered at each point. We describe a rich move set for 
manipulating grammars in intuitive ways, and this enables our search to be effective despite 
its greedy nature. 

Furthermore, this induction framework is not restricted to probabilistic context-free 
grammars. For example, notice that we do not parameterize repetition rules strictly within 
the PCFG paradigm. More complex grammar formalisms can be considered without chang- 
ing the computational complexity of the algorithm; we just need to enhance the move set. 
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Chapter 4 



Aligning Sentences in Bilingual 
Text 



In this chapter, we describe an algorithm for aligning sentences with their translations in 
a bilingual corpus ( Chen, 1993 ). In experiments with the Hansard Canadian parliament 
proceedings, our algorithm yields significantly better accuracy than previous algorithms. 
In addition, it is efficient, robust, language-independent, and parallelizable. Of the three 
structural levels at which we model language in this thesis, this represents work at the 
sentence level. 



4.1 Introduction 

A bilingual corpus is a corpus of text replicated in two languages. For example, the Hansard 
bilingual corpus contains the Canadian parliament proceedings in both English and French. 
Bilingual corpora have proven useful in many tasks, including machine translation ( |Brown ei 



al., 199C| ; [Sadler, 1989| ), sense disambiguation ( jBrown et al, 1991a ; Dagan et al., 1991; Gale 



et al, 1992| ), and bilingual lexicography ( |Klavans and Tzoukermann, 199~C ; Warwick and 



Russell, 19901 ). 



For example, a bilingual corpus can be used to automatically construct a bilingual 
dictionary. A bilingual dictionary can be expressed as a probabilistic model p(f\e) of how 
frequently a particular word e in one language, say English, translates to a particular word 
/ in another language, say French. Intuitively, one should be able to recover such a model 
from a bilingual corpus. For example, if a human translator were to mark exactly which 
French words correspond to which English words, we would be able to count how often a 
given English word e translates to each French word / and then normalize to get p(f\e), 

E/c(e,/) 

where c(e, /) denotes how often the word e translates to /. However, this word alignment 
information is not typically included in a bilingual corpus. 

Consider the case where instead of knowing which words correspond to each other, we 
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English (E) 
Ei Hon. members opposite scoff 
at the freeze suggested by this 
party; to them it is laughable. 



French (F) 

Les deputes d 'en face se 
moquent du gel que a propose 
notre parti. 

Pour eux, c'est une mesure 
risible. 



Figure 4.1: Two-to-one sentence alignment 



just know which sentences correspond to each other. Then, it is still possible to build an 
approximate model of how frequently words translate to each other by using an analogous 
equation to above: 

c'(ej) 



P'(f\e) 



E/C(e,/) 



where c'(e,/) denotes how frequently the words e and / occur in aligned sentences. For 
words e and / that are mutual translations, c'(e,f) will be high since every time e occurs 
in an English sentence / will also occur in the corresponding French sentence. For words 
e and / that are not translations, while c'(e,f) may be above zero it will most probably 
not be very high since it is very unlikely that unrelated words regularly co-occur in aligned 
sentences.[] Thus, with sentence alignment information we can build an approximate word 
translation model p'{f\e). Furthermore, we can use this approximate model p'(f\e) to 
bootstrap the construction of an even more accurate model of word translation. Given 
sentence alignment information and a model p'(f\e) of which words translate to each other, 
we can produce a relatively accurate word alignment. Then, using the procedure described 
in the last paragraph of counting aligned word pairs we can produce an improved word-to- 
word translation model. 

Hence, sentence alignment is a useful step in processing a bilingual corpus. All of the 
applications mentioned earlier such as machine translation and sense disambiguation require 
bilingual corpora that are sentence-aligned. As human translators typically do not include 
sentence alignment information when creating bilingual corpora, automatic algorithms for 
sentence alignment are immensely useful. 

In this work, we describe an accurate and efficient algorithm for bilingual sentence 
alignment. The task is difficult because sentences frequently do not align one-to-one. For 



example, in Figure 4.1 we show an example of two-to-one alignment. In addition, there 



are often deletions in one of the supposedly parallel corpora of a bilingual corpus. These 
deletions can be substantial; in the version of the Canadian Hansard corpus we worked with, 
there are many deletions of several thousand sentences and one deletion of over 90,000 
sentences. Such anomalies are not uncommon in very large corpora. Large corpora are 
often stored as as sizable set of smaller files, some of which may be accidentally deleted or 



1 This is not quite accurate. For example, the count c'(e, "le") may be high for many English words e 
just because the wor d le o ccurs in most French sentences. However, this effect can be corrected for, such as 
described in Section 4.4.1. 
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transposed. 



4.1.1 Previous Work 

The first sentence alignment algorithms successfully applied to large bilingual corpora are 



those of Brown et al. (1991b ) and Gale and Church ( 1991 ; 1993 ). Brown et al. base align- 
ment solely on the number of words in each sentence; the actual identities of words are 
ignored. The general idea is that the closer in length two sentences are, the more likely 
they align. They construct a probabilistic model of alignment and select the alignment 
with the highest probability under this model. The parameters of their model include 
p(e a / ), the probability that a English sentences align with b French sentences, and p(lj\l e ), 
the probability that an English sentence (or sentences) containing l e words translates to a 
French sentence (or sentences) containing If words. These parameters are estimated sta- 
tistically from bilingual text. To search for the most probable sentence alignment under 
this alignment model, dynamic programming ( |Bellman, 1957 ), an efficient and exhaustive 
search method, is used. 

Because dynamic programming requires time quadratic in the number of sentences to be 
aligned and a bilingual corpus can be many millions of sentences, it is not practical to align a 
large corpus as a single unit. The computation required is drastically reduced if the bilingual 
corpus can be subdivided into smaller chunks. Brown et al. use anchors to perform this 
subdivision. An anchor is a piece of text of easily recognizable form likely to be present 
at the same location in both of the parallel corpora of a bilingual corpus. For example, 
Brown et al. notice that comments such as Author = Mr. Cossitt and Time = (14-15) 
are interspersed in the English text of the Hansard corpus and corresponding comments 
are present in the French text. Dynamic programming is first used to determine which 
anchors align with each other, and then dynamic programming is used again to align the 
text between anchors. 

The Gale and Church algorithm is similar to the Brown algorithm except that instead 
of basing alignment on the number of words in sentences, alignment is based on the number 
of characters in sentences. In addition, instead of using a probabilistic model and searching 
for the alignment with the highest probability, they assign lengths to different alignments 
and search for the alignment with the smallest length.^ Dynamic programming is again 
used to search for the best alignment. Large corpora are assumed to be already subdivided 
into smaller chunks. 

While these algorithms have achieved remarkably good performance, there is definite 
room for improvement. For example, consider the excerpt from the Hansard corpus depicted 



in Figure |4.2| . Length-based algorithms do not particularly favor aligning Yes with Oui over 
Non or aligning Mr. Mclnnis with M. Mclnnis over M. Saunders. Thus, such algorithms 
can easily misalign passages like these by an even number of sentences if there are sentences 
missing in one of the languages. Constructions in one language that translate to a very 



This can be interpreted as just working in description space instead of probability space. As described 



in Section 3.2.3, these two spaces are in some sense equivalent. 
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English (E) 




French (F) 




Mr. Mclnnis? 


Ei 


M. Mclnnis? 


E 2 


Yes. 


F 2 


Oui. 


E 3 


Mr. Saunders? 


F 3 


M. Saunders? 


Ei 


No. 


F 4 


Non. 


E b 


Mr. Cossitt? 


F 5 


M. Cossitt? 


E 6 


Yes. 


F 6 


Oui. 



Figure 4.2: A bilingual corpus fragment 



different number of words in the other language may also cause errors. In general, length- 
based algorithms are not robust; they can align unrelated text because word identities are 
ignored. 

Alignment algorithms that take advantage of lexical information offer a potential for 
higher accuracy. Previous work includes algorithms by |Kay and Roscheisen (1993|) and 
Catizone et al. (1989| ) . Kay and Roscheisen perform alignment using a relaxation paradigm. 
They keep track of all possible sentence pairs that may align to each other. Initially, this 
set is very large; it is just constrained by the observation that a sentence in one language 
is probably aligned with a sentence in the other language with the same relative position 
in the corpus. For example, an English sentence halfway through the Hansard English 
corpus is probably aligned to a French sentence near the midpoint of the Hansard French 
corpus. Given this set of possible alignment pairs, word translations are induced based 
on distributional information. Using these induced word translations, the set of possible 
alignment pairs is pruned, which then yields new word translations, etc. This process is 
repeated until convergence. However, previous lexically-based algorithms have not proved 
efficient enough to be suitable for large corpora. The largest corpus aligned by Kay and 
Roscheisen contains 1,000 sentences in each language; existing bilingual corpora have many 
millions of sentences. 

4.1.2 Algorithm Overview 

We describe a fast and accurate algorithm for sentence alignment that uses lexical informa- 
tion. Like Brown et al, we build a sentence-based translation model and find the alignment 
with the highest probability given the model. However, unlike Brown et al. the translation 
model makes use of a word-to-word translation model. We bootstrap these models using a 
small amount of pre- aligned text; the models then refine themselves on the fly during the 
alignment process. The search strategy used is dynamic programming with thresholding. 
Because of thresholding, the search is linear in the length of the corpus so that a corpus 
need not be subdivided into smaller chunks. 

In addition, the search strategy includes a separate mechanism for handling large dele- 
tions in one of the corpora of a bilingual corpus. When a deletion is present, thresholding 
is not effective and dynamic programming requires time quadratic in the length of the dele- 
tion to identify its extent, which is unacceptable for large deletions. Instead, we have a 
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mechanism that keys off of rare words to locate the bounds of a deletion in time linear in 
the length of the deletion. This deletion recovery mechanism can also be used to subdivide 
a corpus into small chunks; this enables the parallelization of our algorithm. 

4.2 The Alignment Model 
4.2.1 The Alignment Framework 

In this section, we present the general framework of our algorithm, that of building a 
probabilistic translation model and finding the alignment that yields the highest probability 
under this model. 

More specifically, we try to find the alignment A with the highest probability given the 
bilingual corpus, i.e., 

A = arg maxp(A\E, F) (4.1) 

A 

where E and F denote the English corpus and French corpus, respectively. (In this paper, 
we assume the two languages being aligned are English and French; however, none of the 
discussion is specific to this language pair, except for the discussion of cognates in Section 
4.3. 6| .) For now, we take an alignment A to be a list of integers representing which sentence 
in the French corpus is the first to align with each successive sentence in the English corpus. 
For example, the alignment A = (1, 2, 4, 5, . . .) aligns the first English sentence with the first 
French sentence, the second English sentence with the second and third French sentences, 
the third English sentence with the fourth French sentence, etc. 
We manipulate equation ( |4.1] ) into a more intuitive form: 

A = argmaxp(74|.E, F) 

A 

p(A,E,F) 
= arg max — — 

a E^KA-E, f) 

= aigmaxp(A,E,F) 

A 

= arg max p(A, F \ E)p(E) 

A 

= argmaxp(^4, F\E) 
A 

The probability p(A,F\E) represents the probability that the English corpus E translates 
to the French corpus F with alignment A. Making the assumption that successive sentences 
translate independently of each other, we can re-express p(A, F\E) as 

1(E) 

p(A,F\E)=l[p(F^- l \E i ) (4.2) 

i=l 

where Ei denotes the ith sentence in the English corpus, F- denotes the ith through jth 
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English (E) 

Ei That is what the consumers 
are interested in and that is 
what the party is interested in. 

E2 Hon. members opposite scoff 
at the freeze suggested by this 
party; to them it is laughable. 



French (F) 

Voila ce qui interesse le 
consommateur et voila ce qui 
interesse notre parti. 
Les deputes d 'en face se 
moquent du gel que a propose 
notre parti. 

Pour eux, c'est une mesure 
risible. 



Figure 4.3: A bilingual corpus 



sentences in the French corpus, Ai denotes the index of the first French sentence aligning to 
the ith English sentence in alignment A, and 1(E) denotes the number of sentences in the 
English corpus.^ We refer to the distribution p(F?\E) as a translation model, as it describes 
how likely an English sentence E translates to the French sentences F- . 
With an accurate translation model p(F?\E), we can use the relation 

1(E) 

A = argmaxp(^4, F\E) = argmax p(F A l+1 \Ei) 
A A i=i 

to perform accurate sentence alignment. To give an example, consider the bilingual corpus 



(E,F) displayed in Figure 4.3. Now, consider the alignment A\ = (1,2) aligning sentence 
Ei to sentence Fi and sentence E 2 to sentences F2 and -F3. We have 

p(Ai,F\E)=p(Fi\Ei)p(Fl\E 2 ), 

This value should be relatively large, since Fi is a good translation of E\ and F 2 is a good 
translation of E 2 . Another possible alignment A 2 = (1,1) aligns sentence E% to nothing 
and sentence E 2 to Fi, F 2 , and F3. We get 

p(A 2 ,F\E)=p(e\Ei)p(Ff\E 2 ) 

This value should be fairly low, as e is a poor translation of Ei and Ff is a poor translation 
of E 2 . Hence, if our translation model p(F?\E) is accurate we will have 

p{A u F\E) ^> P (A 2 ,F\E) 

In general, the more sentences that are mapped to their translations in an alignment A, the 



3 In this discussion and future discussion, we only consider alignments A that are consistent with E and 
F. For example, we do not consider alignments of incorrect length or alignments that refer to sentences 
beyond the end of the French corpus. Obviously, for inconsistent alignments A we have p(A,F\E) — 0. 
Furthermore, in equation (4.2) A l ^ +1 is implicitly taken to be 1(F) + 1; this enforces the constraint that 
the last English sentence aligns with French sentences ending in the last French sentence. 
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higher the value of p(A, F\E). 

However, because our translation model is expressed in terms of a conditional distribu- 
tion p(F- \E), the above framework is not amenable to the situation where a French sentence 
corresponds to multiple English sentences. Hence, we use a slightly different framework. 
We view a bilingual corpus as a sequence of sentence beads ( Brown et ai, 1991b ), where a 
sentence bead corresponds to an irreducible group of sentences that align with each other. 



For example, the correct alignment A\ of the bilingual corpus in Figure |4.3| consists of the 
sentence bead [E%,Fi] followed by the sentence bead [F^jF^]. Instead of expressing an 
alignment A as a list of sentence indices in the French corpus, we express an alignment A as 
a list of pair of indices ((Af, -A{), (A%, Afy, • • •), the indices representing which English and 
French sentence begin each successive sentence bead. Under this new convention, we have 
that A\ = ((1, 1), (2, 2)). Unlike the previous framework, this framework is symmetric and 
can handle the case where a French sentence aligns with zero or multiple English sentences. 
In this framework, instead of taking 

1(E) 

A = argmaxp(^4, F\E) = argmax p(F A * +1 ~ \Ei) 



i=l 



we take 



1(A) 



l ^'-l, 



A = aigmaxp(A,E,F) = argmax p A . len (l(A)) l[p([E2 +1 ,F } +1 ]) (4.3) 



where 1(A) denotes the number of sentence beads in A and PA-len(0 denotes the probability 
that an alignment contains exactly I sentence beads. The term PA-leaiK^)) 1S necessary for 
normalization purposes; otherwise we would not have J^Xe f"P(A, E, F) = 1.0 Instead of 
having a conditional translation model p(F-\E), our translation model is now expressed as 
a distribution p([E?,Fj.]) representing the frequencies of sentence beads [Ef,Fj.]. 



To show this, notice that for any I we have that 

l 

n>( xi ) = $>(*i)$>(*2) ■ = 1 ■ 1 1=1 



x i i — 1 



Applying this relation to equation (4.3), we have that 



1(A) 



£ p(A,E,F) = J2 PA-len(l(A))Y[p([E A ^- ,F A r~ ]) 



A,E,F 



1 l(A)=l,E,F i=1 



X] p A-len(0 



Without the term PA-lenM^)) the sum is infinite. 
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4.2.2 The Basic Translation Model 



As we can see from equation the key to accurate alignment in our framework is 

coming up with an accurate translation model p([E?,Fl]). For this translation model, we 
desire the simplest model that incorporates lexical information effectively. We describe our 
model in terms of a series of increasingly complex models. In this section, we consider 
only sentence beads [E, F] containing a single English sentence E = e\ ■ ■ ■ cue) and single 
French sentence F = f\ - ■ ■ fupy As a starting point, consider a model that assumes that 
all individual words are independent, i.e., a model where the probability of some text is the 
product of the probabilities of each word in the text. More specifically, we can take 

1(E) 1(F) 

p'([E,F])=p c _ len (KE))p i -Ul(F))Y[pe(e i )Y[p f (f j ) 

i=l j=l 

where p e .\ en (l) is the probability that an English sentence is / words long, Pf_i e n(0 is the 
probability that a French sentence is I words long, p e (e%) is the frequency of the word ei in 
English, and Pf(fj) is the frequency of the word fj in French. The terms p c _i on (1(E)) and 
P{-icn(l(F)) are necessary to make J2e fP'([E, F}) = 1 (as in equation ([4.3[)).p| For example, 
we have that 

p'([do you speak French, parlez-vous frangais]) = 

Pe-ien (4)pf_ien (3)p e ( do)p e (you)p e ( speak)p e (French)p f (parlezjp / ( vous)p f (frangais) 

Clearly, this is a poor translation model because it takes the English sentence and French 
sentence to be independent, so it does not assign higher probabilities to sentence pairs that 
are translations. For example, it would assign about the same probabilities to the following 
two sentence beads: 

[do you speak French, parlez-vous frangais] 
[did they eat German, parlez-vous frangais] 

To capture the dependence between individual English words and individual French 
words, we assign probabilities to word pairs in addition to just single words. For two words 
e and / that are mutual translations, instead of having the two terms p e (e) and Pf(f) in 
the above equation we would like a single term p(e, f) that is substantially larger than 
p e (e)pf(f). To this end, we introduce the concept of a word bead. A word bead is either 
a single English word, a single French word, or a single English word and a single French 
word. We refer to these as 1:0, 0:1, and 1:1 word beads, respectively. Instead of modeling a 
pair of sentences as a list of independent words, we model sentences as a list of word beads, 
using the 1:1 word beads to capture the dependence between English and French words. To 

5 This model can be considered somewhat similar to the alignment model used by Brown et al. As the 
probability of words are taken to be independent, these probabilities do not depend on sentence alignment 
and can be ignored, as in Brown. However, unlike Brown we also take sentence lengths to be independent; 
a model closer to Brown would express sentence lengths as a joint probability p\ en (l(E),l(F)) that assigned 
higher probabilities to sentence pairs with similar lengths. 
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address the issue that corresponding English and French words may not occur in identical 
positions in their respective sentences, we abstract over the positions of words in sentences; 
we consider sentences to be unordered multisets^] of words, so that 1:1 word beads can pair 
words from arbitrary positions in sentences. 

As a first cut at this behavior, consider the following "model": 

i(b) 

p"(b)=Pb-len(l(b))Y[ Pb (b i ) 
i=l 

where b = {bi, . . . , bui)} is a multiset of word beads, j>b-ien(0 is the probability that an En- 
glish sentence and a French sentence contain I word beads, and Pb(b{) denotes the frequency 
of the word bead bi. This simple model captures lexical dependencies between English and 
French sentences. For example, we might express the sentence bead 

[do you speak French, parlez-vous frangais] 

with the word beading 

b = {[do], [you, vous], [speak, parlez], [French, frangais]} 

with probability 

p"(b) = Pb-ien(4)p b ([do])p b ([you, vous])p b ([speak, parlez])p b ([French, frangais]) 

If pb([e, f]) 3> for words e and / that are mutual translations, then word 

headings of sentence pairs that contain many words that are mutual translations can have 
much higher probability than word headings of unrelated sentence pairs. 

However, this "model" p"(b) does not satisfy the constraint that J2b~P"(b) = 1- To see 
this, consider the case that the ordering of beads is significant so that instead of having 
a multiset b = {&i, . . . , we have a list b = (b\, . . . ,b^). For this case, we have 

J2bP"(b) = 1 (as for equation (fDj)). Then, because our headings b are actually unordered, 
multiple terms in this last sum that are just different orderings of the same multiset will 
be collapsed to a single term in our actual summation. Hence, the true YllP"(P) wm ' 3e 
substantially less than one. To force this model to sum to one, we simply normalize to 
retain the qualitative aspects of the model. We take 

P(b) = 7- T YiPbih) 

ly l(b) i=l 

where 

I 

N t =J2 liPb&i) (4.4) 

l(b)=l i=l 

A multiset is a set in which a given element can occur more than once. 
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To derive the probabilities of sentence beads p([E, F]) from the probabilities of word 
headings p(b), we need to consider the issue of word ordering. A beading b describes an 
unordered multiset of English and French words, while sentences are ordered sequences of 
words. We need to model word ordering, and ideally the probability of a sentence bead 
should depend on the ordering of its component words. For example, the sentence John 
ate Fido should have a higher probability of aligning with the sentence Jean a mange Fido 
than with the sentence Fido a mange Jean. However, modeling how word order mutates 



under translation is notoriously difficult ( Brown et al, 1993 ), and it is unclear how much 



improvement in accuracy an accurate model of word order would provide. Hence, we ignore 
this issue and take all word orderings to be equiprobable. Let 0{E) denote the number of 
distinct ways of ordering the words in a sentence E^\ Then, we take 

for those sentence beads [E, F] consistent with b, i.e., those sentence beads containing the 
same words as b. (For inconsistent beads [E,F], we have p([E,F]\b) = 0.) 
This gives us 

p([E,F],b)=p(b)p([E,F]\b) = N ^^ol F) ilPb(bi) 

To get the total probability p([E, F]) of a sentence bead, we need to sum over all headings 
b consistent with [E,F], giving us 

p(1 e, f]) =x P(P,ns)=_£ n ^eX) n» w (4 - 6) 

b~[E,F] b~[E,F] l W V ' v ' i=l 

where b ~ [E, F] denotes b being consistent with [E, F]. 



4.2.3 The Complete Translation Model 

In this section, we extend the translation model to other types of sentence beads besides 
beads that contain a single English and French sentence. Like Brown et al., we only consider 
sentence beads consisting of one English sentence, one French sentence, one English sentence 
and one French sentence, two English sentences and one French sentence, and one English 
sentence and two French sentences. We refer to these as 1:0, 0:1, 1:1, 2:1, and 1:2 sentence 
beads, respectively. 

7 For a sentence E containing words that are all distinct, we just have O(E) = l(E)\. More generally, 
we have that 0{E) = (x m ^:ji)) where {m(ei)} denotes the multiplicities of the different words e; in the 
sentence. 
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For 1:1 sentence beads, we take 



This differs from equation (4J3) in that we have added the term p(l : 1) representing the 
probability or frequency of 1:1 sentence beads; this term is necessary for normalization 
purposes now that we consider other types of sentence beads. In addition, we now refer 
to pb-ien as pi;i and to iVj as iV- as there will be analogous terms for the other types of 
sentence beads. 

To model 1:0 sentence beads, we use a similar equation except that we only need to 
consider 1:0 word beads (i.e., individual English words), and we do not need to sum over 
headings since there is only one word beading consistent with a 1:0 sentence bead. We take 

P([^)=P(l:0)SttnA(e,) (4.8) 

ly i(E) U ^> i=l 

Instead of the distribution pb(bi) of word bead frequencies, we have the distribution p e {ei) 
of English word frequencies. We use an analogous equation for 0:1 sentence beads. 
For 2:1 sentence beads, we take 

p([Et\F])=p(2:l) £ n^fo) (4.9) 

We use an analogous equation for 1:2 sentence beads 

4.3 Implementation 

In this section, we describe our implementation of the alignment algorithm. We describe 
how we train the parameters or probabilities associated with the translation model, and how 
we perform the search for the best alignment. In addition, we describe approximations that 
we use to make the algorithm computationally tractable. We hypothesize that because our 
translation model incorporates lexical information strongly, correct alignments are tremen- 
dously more probable than incorrect alignments so that moderate errors in calculation will 



8 To be more consistent with 1:1 sentence beads, in equation (jfjj) instead of the expression 0(Ei)0(Ei+i) 
we should have the expression 0(i?* +1 )(Z + 1) where I = l(Ei) + l(E i+ i). There are 0(i?* +1 ) different ways 
to order the I English words in b, and there are / + 1 different places to divide the list of I Eng lish words 
into two sentences (assuming we allow sentences of length zero). Thus, instead of equation (fLq) as for 1:1 
sentence beads, we have 

p([El +1 ,F]\b)= ' 



0(El +1 )0(F)(l + l) 

if we take all of these possibilities to be equiprobable. However, we choose the expression 0(Ei)0(Ei+i) 
because then the contribution of this word ordering factor is independent of alignment and can be igno red. 



Mathematically, we account for this choice through the normalization constants iV; : ; see Section 4.3.2 
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not greatly affect results. 



4.3.1 Evaluating the Probability of a Sentence Bead 

The probability of a 0:1 or 1:0 sentence bead can be calculated efficiently using equation 



in Section [4. 2 .3| . To evaluate the probabilities of other types of sentence beads exactly 
requires a sum over a vast number of possible word headings. We make the gross approxi- 
mation that this sum is roughly equal to the maximum term in the sum. For example, for 
1:1 sentence beads we have 

Even with this approximation, the calculation of p{[E,F]) is still expensive since it 
requires a search for the most probable beading. We use a greedy heuristic to perform this 
search; the heuristic is not guaranteed to find the most probable beading. We begin with 
every word in its own bead. We then find the 0:1 bead and 1:0 bead that, when replaced 
with a 1:1 word bead, results in the greatest increase in the probability of the beading. We 
repeat this process until we can no longer find a 0:1 and 1:0 bead pair that when replaced 
would increase the beading's probability. 

For example, consider the sentence bead 

[do you speak French, parlez-vous frangais] 

To search for the most probable word beading of this sentence bead, we begin with each 
word in its own word bead: 

b = {[do], [you], [speak], [French], [parlez], [vous], [frangais]} 

Then, we find the pair of word beads that when replaced with a 1:1 bead results in the 
largest increase in p(b); suppose this pair is [French] and [frangais]. We substitute in the 
1:1 bead yielding 

b = {[do], [you], [speak], [French, frangais], [parlez], [vous]} 

We repeat this process until there are no more pairs that are profitable to replace; this is 
apt to yield a beading such as 

b = {[do], [you, vous], [speak, parlez], [French, frangais]} 

This greedy search for the most probable beading can be performed in time roughly linear 
in the number of words in the involved sentences, as long as the probability distribution 
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Pb([e, /]) is fairly sparse, that is, as long as for most words e, there are few words / such 
that Pb([e, /]) > 0. To perform this search efficiently, for each English word e we maintain 
a list of all French words / such that p&([e, /]) > 0. In addition, we maintain a list for each 
French word storing information about the current sentence. (For expository purposes, we 
assume the sentence bead contains a single English and French sentence.) Then, for a given 
sentence bead we do as follows: 

• For each word e in the English sentence, we take all associated beads [e, /] with 
Pb([s, /]) > and append these beads to the list associated with the word /. 

• We take the lists associated with each word / in the French sentence, and merge them 
into a single list. We sort the resulting list according to the increase in probability 
associated with each bead if substituted into the beading. 

With this procedure, we can find all applicable beads [e, /] with nonzero probability in 
near-linear time in sentence length. With the sorted list of beads, performing the greedy 
search efficiently is fairly straightforward. 

4.3.2 Normalization 

The exact evaluation of the normalization constants Ni is very expensive. For example, for 
1:0 sentence beads we have that 

Nt°= £ rip e (e,) 

e={ei,...,e i }«=l 



This is identical to the normalization for 1:1 sentence beads given in equation (1.4), except 
that we restrict word beads to be single English words as 1:0 sentence beads only contain 
English. It is impractical to sum over all sets of words {e±, . . . , e{\. Furthermore, we con- 
tinually re-estimate the parameters p e (&i) during the alignment process, so the exact value 
of iV, is constantly changing. Hence, we only approximate the normalization constants 
N h 

Let us first consider the constants N^. Notice that when we sum over ordered lists e 
instead of unordered sets e, we have the relation 

e n^)=i 

e=(ei,...,ei) i=l 

Let 0(e) be the number of distinct orderings of the elements in the (multi-)set e = 
{bi, . . . ,bi}; this is equal to the number of different lists e that can be formed using all 
of the words in e. Then, we have 

l 

£ 0(e)Hp e ( ei ) = l 

e={ei,...,ei) i=l 
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We make the approximation that 0(e) = l\ for all e. This approximation is exact for all 
sets E containing no duplicate words. This gives us 

N l'-°= E UPM = Ji E E 0{e)X[p e {e l ) = \ 

e={ei,...,ej} i=l ' e={ei,...,ej} t=l ' e={ei,...,e;} t=l 

We use analogous approximations for iV, and iVj . 

Now, let us consider the constants Nf . These need to be calculated differently from 
the above constants. To show this, we contrast the 2:1 sentence bead case with the 1:1 



sentence bead case given in Section [4.2.2; . For 1:1 sentence beads, we have 

1 



p([E,F]\b) 



0(E)0(F) 



and this results in the expression 0{E)0(F) in equation (4.7). The analogous expression 

on i 

p([Ei+\F}\b) 



for 2:1 sentence beads in equation ( [4.9Q is 0(Ei)0(Ei + \)0(F) . However, the distribution 

1 



0(Ei)0(E i+1 )0(F) 



is not proper in that Y^E i+1 > 7^ lj as described in Section 4.2.3, the expression 



0(Ei)0(Ei + i)0(F) in not equal to the number of different 2:1 sentence beads corresponding 
to b. Thus, the derivation for 1:1 sentence beads does not hold for 2:1 sentence beads and 
we need to calculate the normalization constants differently. 

Instead, we choose Nf' A so that the probabilities p([E^~ , F], b) sum correctly. In par- 
ticular, we want to choose Nf such that 

J2 p([Ei +1 ,F},b)=p(2:l)p 2 :i(l) 



E\ +1 ,F,l(b)=l 

as p(2 : l)p2:i(0 is the amount of probability allocated by the model for word headings of 
length I of 2:1 sentence beads. Substituting in equation (|4.9[) , we get 



b~{El +1 ,F],l(b)=l 1 V ' V + ' V ; 4=1 

Rearranging, we get 

1 1 

N? '' 1= - . + ? _ 0( Ei )0(E i+1 )0(F) U*™ 

b~[El +1 ,F],l(b)=l V J V + ; V ' 4=1 
Now, we re-express the sum over Ei, £^+1, and F as a sum over unordered sets ej 
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{ei, . . . , e m) }, e i+1 ={e[,..., e' l{Ei+i) }, and / = {/i, . . . , f l{F) }, giving 



us 



AT 2-.i V- Q{ei)0(e i+ i)Q{f) x - t-t , . 

6~[e i ,e i+1 ,/] > K6)=/ ^W^+lWJ i=1 6~[e,,e j+1) /],Z(6)=H=l 

Then, let us consider how many different [ei, e~i+i, f] are consistent with a given word beading 
b. There is only a single way to allocate the French words in b to get /; however, there 
are many ways of dividing the English words in b to get e~j and e$+i. In particular, there 
are 2 ne ^ ways of doing this, where n e (b) denotes the number of English words in b. Each 
of the n e (b) English words in b can be placed in either of the two English sets. Using this 
observation, we have that 

l(b)=l *=1 

To evaluate this equation, we use several approximations. The first approximation we 
make is that Pb(bi) is a uniform distribution, i.e., pb(b) = -g for all b where B is the total 
number of different word beads. This gives us 

- 1 1 1 

^2:1 _ *7 2 n e (b) "Q _£ = ± ^ 2 n e (6) 
Z(E)=Z i=l ^ ^ i(b)=i 

Then, let bi(n) be the number of bead sets b of size I containing exactly n English words. 
We can rewrite the preceding sum as 

N i :1 « ( 4 - 10 ) 

To approximate bi(n), let i?+ e be the number of different word beads containing English 
words (i.e., 1:0 and 1:1 beads), and let B- e be the number of different word beads not 
containing English words (i.e., 0:1 beads), so that B = B +e + B_ e . Notice that the number 
of bead lists (as opposed to sets) b[{n) of length I containing n English words is 

*{(»)= 

To estimate the number of bead sets bi(n), we use the same approximation used in calcu- 
lating -/V^ and simply divide by /!. This gives us 



Substituting this into equation ( |4. 10 ) , we get 



' BH\ 



" ~ m £ (,') = w E (f) cw*** 
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Using the binomial identity (x + y) 1 = J2n=o Cn) xn y l n > we S e t 

AT 2:1 _ 1 / 9R , r> ^^ _ 1 ! 2B +e+B- e !_ 1 ^ +e + 5 , (1 + %^ 

We use an analogous approximation for N^ 2 . 
4.3.3 Parameterization 

To model the parameters pA-\en{L) in equation ( ^4.3| ) representing the probability that a 
bilingual corpus is L sentence beads in length, we assume a uniform distribution^ it is 
unclear what a priori information we have on the length of a corpus. This allows us to 
ignore the term, since this length will not affect the probability of an alignment. 
We model sentence length (in beads) using a Poisson distribution, i.e., 

**(0 = ^ (4-11) 

for some Ai : o, and we have analogous equations for the other types of sentence beads. To 
prevent the possibility of some of the A's being assigned unnaturally small or large values 
during the training process to specifically model very short or very long sentences, we tie 
together the A values for the different types of sentence beads. We take 

\ - x - Al;1 - A2:1 - Al:2 (a ^o\ 

Ai:0 - A 0: l - -y- - -y - -y (4.12) 

In modeling the frequency of word beads, there are three distinct distributions we need 
to model: the distribution p e (ej) of 1:0 word beads in 1:0 sentence beads, the distribution 
Pf(fi) °f : 1 word beads in 0:1 sentence beads, and the distribution of all word beads Pb{bi) 
in 1:1, 2:1, and 1:2 sentence beadsQ To reduce the number of independent parameters 
we need to estimate, we tie these distributions together. We take p e (ei) and Pf(fi) to be 
identical to Pbih), except restricted to the relevant subset of word beads and normalized 
appropriately, i.e., 

Pb([e}) 



Pe{e> 
and 



Pf(f) 



EePb([ e D 
Pb([f}) 



EfPbdf]) 

where [e] and [/] denote 1:0 and 0:1 word beads, respectively. 

To further reduce the number of parameters, we convert all words to lowercase. For 
example, we consider the words May and may to be identical. 



9 To be precise, we assume a uniform distribution over some arbitrarily large finite range, as one cannot 
have a uniform distribution over a countably infinite set. 

10 Conceivably, we could consider using three different distributions Pb(bi) for 1:1, 2:1, and 1:2 sentence 
beads. However, we assume these distributions are identical to reduce the number of parameters. 
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4.3.4 Parameter Estimation Framework 



The basic method we use for estimating the parameters or probabilities of our model is 
to just take counts on previously aligned data and to normalize. For example, to estimate 
p(l : 0), the probability or frequency of 1:0 sentence beads, we count the total number of 1:0 
sentence beads in previously aligned data and divide by the total number of sentence beads. 
To bootstrap the model, we first take counts on a small amount of data that has been aligned 
by hand or by some other algorithm. Once the model has been bootstrapped, it can align 
sentences by itself, and we can take counts on the data already aligned by the algorithm to 
improve the parameter estimates for aligning future data. For the Hansard corpus, we have 
found that one hundred sentence pairs are sufficient to bootstrap the alignment model. 

This method can be considered to be a variation of the Viterbi version of the expectation- 
maximization (EM) algorithm ( pempster et al., 1977 ). In the EM algorithm, an expectation 
phase, where counts on the corpus are taken using the current estimates of the parameters, 
is alternated with a maximization phase, where parameters are re-estimated based on the 
counts just taken. Improved parameters lead to improved counts, which lead to even more 
accurate parameters. In the incremental version of the EM algorithm we use, instead 
of re-estimating parameters after each complete pass through the corpus, we re-estimate 
parameters after each sentence. By re-estimating parameters continually as we take counts 
on the corpus, we can align later sections of the corpus more reliably based on the alignment 
of earlier sections. We can align a corpus with only a single pass, simultaneously producing 
alignments and updating the model as we proceed. 

However, to align a corpus in a single pass, the model must be fairly accurate before 
starting or else the beginning of the corpus will be poorly aligned. Hence, after bootstrap- 
ping the translation model on one hundred sentence pairs and before starting the one pass 
through the entire corpus to produce our final alignment, we first refine the translation 
model by using the algorithm to align a chunk of the unaligned target bilingual corpus. In 
experiments with the Hansard corpus, we train on 20,000 unaligned sentence pairs before 
performing the final alignment. 

Because the search algorithm considers many partial alignments simultaneously, it is not 
obvious how to determine when it is certain that a particular sentence bead will be part of 
the final alignment and thus can be trained on. To elaborate, our search algorithm maintains 
a set of partial alignments, each partial alignment representing a possible alignment between 
some prefix of the English corpus and some prefix of the French corpus. These hypothesis 
alignments are extended incrementally during the search process. To address the problem 
of determining which sentence beads can be trained on, we keep track of the longest partial 
alignment common to all partial alignments currently being considered. It is assured that 
this common partial alignment will be part of the final alignment. Hence, whenever a 
sentence bead is added to this common alignment we use it to train on. The point in the 
corpus at the end of this common alignment is called the confluence point. 
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4.3.5 Parameter Estimation Details 



One issue with the framework described in the last section is that in a straightforward 
implementation, probabilities that are initialized to zero will remain zero during the training 
process. An object with zero probability will never occur in an alignment, and thus it will 
never receive any counts. The probability of an object is just its count normalized, so 
such an object will always have probability zero. Thus, it is important to initialize all 
probabilities to nonzero values. Unless otherwise specified, we achieve this by setting all 
counts initially to 1. 

We now describe in detail how we estimate specific parameters. To estimate word bead 
frequencies Pb(b), we maintain a count c(b) for each word bead b reflecting the number of 
times that word bead has occurred in the most probable word beading of a sentence bead. 
More specifically, given some aligned data to train on, we first find the most probable word 
beading of each sentence bead in the alignment using the current model, and we then use 
these most probable word headings to update word bead counts. We take 

Pb{b) - 



E b c(b) 



For 0:1 and 1:0 word beads, we initialize the counts c(b) to 1. For 1:1 word beads, we 
initialize these counts to zero; this is because our algorithm for searching for the most 
probable word beading of a sentence bead will not be efficient unless pb([e, /]) is sparse, as 



described in Section 4.3.1. Instead, we use a heuristic for initializing particular c([e, /]) to 
nonzero values during the training process; whenever we see a 0:1 and a 1:0 word bead occur 
in the most probable beading of a sentence bead, we initialize the count of the corresponding 
1:1 word bead to a small value. PI This heuristic is effective for constraining the number of 
1:1 word beads with nonzero probability. 

To estimate the sentence length parameters A, we divide the number of word beads 
in the most probable headings of the previously aligned sentences by the total number 
of sentences. This gives us the mean number of word beads per sentence. In a Poisson 
distribution, the mean coincides with the value of the A parameter. (Recall that we model 
sentence length with a Poisson distribution as in equation ( f4.11| ),) We take Ai : o to be this 
mean value, and the other A parameters can be calculated using equation ( |4,12j ). For the 
situation before we have any counts, we set Ai : o to an arbitrary constant; we chose the value 
7. 

To estimate the probabilities p(l : 0), p(0 : 1), p(l : 1), p(2 : 1), and p(l : 2) of each 
type of sentence bead, we count the number of times each type of bead occurred in the 
previously aligned data and divide by the total number of sentence beads. These counts 
are initialized to 1. 



11 The particular value we use is , where n e denotes the number of 1:0 word beads in the beading, 

and n/ denotes the number of 0:1 word beads. This can be thought of as dividing counts evenly 

among all n e n/ bead pairs. 



125 



4.3.6 Cognates 

There are many words that possess the same spelling in two different languages. For exam- 
ple, punctuation, numbers, and proper names generally have the same spellings in English 
and French. Such words are members of a class called cognates (Simard el al, 1992). 



Because identically spelled words can be recognized automatically and are frequently trans- 
lations of each other, it is sensible to use this a priori information in initializing word bead 
frequencies. To this end, we initialize to 1 the count of all 1:1 word beads that contain 
words that are spelled identically. 

4.3.7 Search 

It is natural to use dynamic programming to search for the best alignment; one can find the 
most probable of an exponential number of alignments using quadratic time and memory. 
Using the same perspective as Pale and Church (1993| ), we view alignment as a "shortest 
path" problem. Recall that we try to find the alignment A such that 

1(A) A s \ A f \ 

A = avgmaxp(A,E,F) = arg max p A -len(l(A)) J] P{[E^ +1 ~ l , 

A A i= i 

Manipulating this equation, we get 

A = arg maxp(A, E, F) 
A 

= arg min — lnp(A, E, F) 

A 

1(A) f 
= argmin-ln{p A _ lcn (/(i')) J[ pdE^' 1 , FJ + ^})} 



A i=i 

1(A) 



= argmin{-lnp A . len (/(i > )) + ^-lnp([^I +1 \f^ *])} 
A j=i ' « 

1(A) ^ j 

= arg min £ -lnp([Ej +1 '\F J^ 1 ' 1 }) 
A i=\ 

(Recall that we take PA-ien(0 to be a uniform distribution.) In other words, finding the 
alignment with highest probability is equivalent to finding the alignment that minimizes 
the sum of the negative logarithms of the probabilities of the sentence beads that compose 
the alignment.^] Thus, by assigning to each sentence bead a "length" equal to the nega- 
tive logarithm of its probability, the sentence alignment problem is reduced to finding the 
shortest path from the beginning of a corpus to its end. 

The shortest path problem has a well-known dynamic programming solution. We eval- 
uate a lattice D(i,j) representing the shortest distance from the beginning of the corpus 



12 This transfo rmati on is equivalent to the transformation between probability space and length space 



given in Section 3.2.3. (Recall that — hip = In -. 
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to the ith English sentence and jth French sentence. The distance D(i,j) is equal to the 
length of the most probable alignment aligning the first i and j sentences of the English 
and French corpora, respectively. This lattice can be calculated efficiently using a simple 
recurrence relation; in this case, the recurrence relation is: 



D(i,j) = min < 



D(i-l,j) + -]np([Ei]) 

D(i,j-1) + -]np{[Fj\) 

D(i-lJ-l) + -InpdE^Fj]) (4.13) 

D(i-2,j-l) + -InpdEj^Fj]) 

[ £>(i-l,j-2) + -InpdE^F^}) 



In other words, the most probable alignment of the first i and j English and French sentences 
can be expressed in terms of the most probable alignment of some prefix of these sentences 
extended by a single sentence bead. The rows in the equation correspond to 1:0, 0:1, 1:1, 
2:1, and 1:2 sentence beads, respectively. The value Z)(0,0) is taken to be zero. The value 
D(l(E), 1(F)) represents the shortest distance through the whole corpus, and it is possible to 
recover the alignment corresponding to this shortest path through some simple bookkeeping. 

Intuitively, this search can be viewed as maintaining a set of partial alignments and 
extending them incrementally. We fill in the lattice in increasing diagonals, where the kth 
diagonal consists of all cells D(i,j) such that i + j = k; the A:th diagonal corresponds to 
alignments containing a total of k sentences. Each cell D(i,j) corresponds to the most 
probable alignment ending at the ith and jth sentence in the English and French corpora. 
We can consider the alignments corresponding to the D(i,j) in the current diagonal i+j = k 
to be the set of current partial alignments. Filling in the lattice in increasing diagonals can 
be considered as extending the current partial alignments incrementally. 

Notice that this algorithm is quadratic in the number of sentences in the bilingual 
corpora, as we need to fill in a lattice with l(E)l(F) cells. Given the size of existing bilingual 
corpora and the computation necessary to evaluate the probability of a sentence bead, a 
quadratic algorithm is too profligate. However, we can reap great savings in computation 
through intelligent thresholding. Instead of evaluating the entire lattice D(i,j), we ignore 
parts of the lattice that look as if they correspond to poor alignments. By considering only 
a subset of all possible alignments, we reduce the computation to a linear one. 

More specifically, we notice that the length D(i,j) of an alignment prefix is proportional 
to the number of sentences in the alignment prefix i + j. Hence, it is reasonable to compare 
the lengths of two partial alignments if they contain the same number of sentences. We 
prune all alignment prefixes that have a substantially lower probability than the most 
probable alignment prefix of the same length. That is, whenever D(i,j) > D(i',j f ) + c 
for some i',j' and constant c where i+j = i' + j' , we set D(i,j) to oo. This discards 
from consideration all alignments that begin by aligning the first i English sentences with 
the first j French sentences. We evaluate the array D(i,j) diagonal by diagonal, so that 
i + j increases monotonically. For c, we use the value 500, and with the Hansard corpus 
this resulted in an average search beam width through the dynamic programming lattice of 
about thirty; that is, on average we evaluated thirty different D(i,j) such that i + j = k for 
each value k. 
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4.3.8 Deletion Identification 



The dynamic programming framework described above can handle the case when there 
is a small deletion in one of the corpora of a bilingual corpus. However, the framework 
is ineffective for deletions larger than hundreds of sentences; the thresholding mechanism 
is unreliable in this situation. When the deletion point is reached, the search algorithm 
will attempt to extend the current partial alignments with sentence beads that align unre- 
lated English and French sentences. There will be no correct alignment with significantly 
higher probability to provide a meaningful standard with which to threshold against. Any 
thresholding that occurs will be due to the random variation in alignment probabilities. 

One solution is to not threshold when a deletion is detected. However, this is also 
impractical since dynamic programming is quadratic without thresholding and deletions 
can be many thousands of sentences long. Thus, we handle long deletions outside of the 
dynamic programming framework. 

To detect the beginning of a deletion, we use the confluence point mentioned in Section 



4.3.4 as an indicator. Recall that the confluence point is the point at the end of the longest 
partial alignment common to all current hypothesis alignment s.0 The distance (in sen- 
tences) from the confluence point to the diagonal in the lattice D(i,j) currently being filled 
in can be thought of as representing the uncertainty in alignment at the current location 
in the lattice. In the usual case, there is one correct alignment that receives vastly greater 
probability than other alignments, and thresholding is very aggressive so this distance is 
small. However, when there is a large deletion in one of the parallel corpora, consistent 
lexical correspondences disappear so no one alignment has a much higher probability than 
the others. Thus, there will be little thresholding and the distance from the confluence 
point to the frontier of the lattice will become large. When this distance reaches a certain 
value, we take this to indicate the beginning of a deletion. 

In thresholding with c = 500 on the Hansard corpus, we have found that the confluence 
point is typically about thirty sentences away from the frontier of D(i,j). Whenever the 
confluence point lags 400 sentences behind the frontier, we assume a deletion is present. 

To identify the end of a deletion, we search for the occurrence of infrequent words 
that are mutual translations. We search linearly through both corpora simultaneously. All 
occurrences of words whose frequency is below a certain value are recorded in a hash table; 
with the Hansard corpus we logged words occurring ten or fewer times previously. Whenever 
we notice the occurrence of a rare word e in one corpus and its translation / in the other 
(i.e., Pb([e, /]) > 0), we take this as a candidate location for the end of the deletion. 

To give an example, assume that the current confluence point is located after the ith 
English sentence and j'th French sentence, and that we are currently calculating the diagonal 
in D consisting of alignments containing a total of % + j + 400 sentences. Since this frontier 
is 400 sentences away from the confluence point, we assume a deletion is present. We 



13 To calculate the confluence point, we keep track of all sentence beads currently belonging to an active 
partial alignment. Whenever a sentence bead becomes the only active bead crossing a particular diagonal 
in the distance lattice D, i.e., the only active sentence bead [El^,F^] such that ii +ji < k and 12 +J2 > k 
for some k, then we know all active partial alignments include that sentence bead and we can move the 
confluence point ahead of that sentence bead. 
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iterate through the English and French corpora simultaneously starting from the ith and 
jth sentences, respectively, logging all rare words. Suppose the following rare words occur: 



language 


sentence # 


word 


TJI „ 7..' I, 

English 


i + 7 


Socratic 


English 


i + 63 


epidermis 


French 


j + 127 


indemnisation 


English 


i + 388 


Topeka 


French 


j + 416 


gypsophile 


English 


i + 472 


solecism 


French 


j + 513 


socratique 









When we reach the 513th French sentence after the confluence point, we observe the word 
socratique which is a translation of the word Socratic found in the 7th English sentence 
after the confluence point. (We assume that we have pb([Socratic, socratique]) > 0.) We 
then take the (i + 7)th and (j + 513)th English and French sentences to be a candidate 
location for the end of the deletion. 

We test the correctness of a candidate location using a two-stage process for efficiency. 
First, we calculate the probability of the sentence bead composed of the two sentences 
containing the two rare words. If this is "sufficiently high," we then examine the forty 
sentences following the occurrence of the rare word in each of the two parallel corpora. We 
use dynamic programming to find the probability of the best alignment of these two blocks 
of sentences. If this probability is also "sufficiently high" we take the candidate location 
to be the end of the deletion. Because it is extremely unlikely that there are two very 
similar sets of forty sentences in a corpus, this deletion identification algorithm is robust. 
In addition, because we key off of rare words in searching for the end of a deletion, deletion 
identification requires time linear in the length of the deletion. 

We consider the probability factual of an alignment to be "sufficiently high" if its score 
is a certain fraction / of the highest possible score given just the English sentences in the 
segment. More specifically, we use the following equation to calculate /: 

j _ (~ln factual) - (-hiPmin) 

( - In p max ) - ( - In p min ) 

To calculate p m ax ; we calculate the French sentences that would yield the highest possible 
alignment score given the English sentences in the alignment; these sentences can be con- 
structed by just taking the most probable word-to- word translation for each word in the 
English sentences. The probability of the alignment of these optimal French sentences with 
the given English sentences is p max . The probability p m - m is taken to be the probability 
assigned to the alignment where the sentences are aligned entirely with 0:1 and 1:0 sentence 
beads; this approximates the lowest possible achievable score. 

We take this quotient / to represent the quality of an alignment. For the initial sentence- 
to-sentence comparison, we took the fraction 0.57 to be "sufficiently high." For the align- 
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ment of the next forty sentences, we took the fraction 0.4 to be "sufficiently high." These 
values were arrived at empirically, by trying several values on several deletion points in the 
Hansard corpus and choosing the value with the best subjective performance. Reasonable 
performance can be achieved with values within a couple tenths of these values; higher or 
lower values may be used to improve alignment precision or recall £j 

Because we key off of rare words in recovering from deletions, it is possible to overshoot 
the true recovery point by a significant amount. To correct for this, after we find a location 
for the end of a deletion using the mechanisms described previously, we backtrack through 
the corpus. We take the ten preceding sentences in each corpus from the recovery point, 
and find the probability of their alignment. If this probability is "sufficiently high," we 
move the recovery point back and repeat the process. We take the fraction 0.4 using the 
measure described in the last paragraph to be "sufficiently high." 

4.3.9 Subdividing a Corpus for Parallelization 

Sentence alignment is a task that seems well-suited to parallelization, since the alignment 
of different sections of a bilingual corpus are basically independent. However, to parallelize 
sentence alignment it is necessary to be able to divide a bilingual corpus into many sections 
accurately. That is, division points in the two corpora of a bilingual corpus must correspond 
to identical points in the text. Our deletion recovery mechanism can be used for this 
purpose. We start at the beginning of each corpus in the bilingual corpus, and skip some 
number of sentences in each corpus. The number of sentences we skip is the number of 
sentences we want in each subdivision of the bilingual corpus. We then employ the deletion 
recovery mechanism to find a subsequent point in the two corpora that align, and we take 
this to be the end of the subdivision. We repeat this process to divide the whole corpus 
into small sections. 

4.3.10 Algorithm Summary 

We summarize the algorithm below, not including parallelization. 



initialize all counts and parameters 

; bootstrap the model by training on manually- aligned data 

' ^ A \ ' A 1 J'***' ^ A l ' A f * ' 
for i = 1 to L do 
begin 

b := most probable word beading of [E A l +1 , F } +1 

i A. 



We use precision to describe the fraction of sentence beads returned by the algorithm that represent 
correct alignments; recall describes the fraction of all correct sentence beads that are found by the algorithm. 
In these measures, we only consider sentence beads containing both English and French sentences as these 
are the beads most useful in applications. 
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update counts and parameters based on b 
end 

; this is the main loop where we fill the D(i,j) lattice 

j (iamfi jconf) holds indices of the English and French sentences at confluence point 
; k holds value of current diagonal being filled; diagonal is all D(i,j) with k = i+ j 

^conf ■ 1 
Jconf • — 1 

D(0,0) := 

for k = 2 to 1(E) + 1(F) do 
begin 

; check for long deletion 
if k - (i conl + j' conf ) > 400 then 
begin 



(*end) Jend) := location of end of deletion (see Section 4.3. 

k ■ — iend ~\~ Jend 

(^conf) Jconf ) • (^end)Jend) 

end 

; fill in diagonal in D array 

; Df, es t holds the best score in the diagonal, used for thresholding purposes 
A>est := oo 

; only fill in cells in diagonal corresponding to extending a nonthresholded alignment 
for all i, j > 1 such that i + j = k and 

3i',f with i - 2 < i! < i, j - 2 < f < j, D(i',f) < oo do 

begin 



D(i,j) '■= expression given in equation l -j.lc ) 
if D(i,j) < D hcst then 
Dbcst := D(i,j) 

end 



; threshold items in diagonal 

for all i,j > 1 such that i+ j = k and D(i,j) < oo do 
if D(i,j) < Dbcst- 500 then 
D(i,j) := oo 



; update confluence point (see Section 4-3.i) 
if confluence point has moved then 
begin 

(*conf> Jconf) := new location of confluence point 

A e ,,-i A/,,-1 
At 

begin 

- A e —1 A^ — 1 

b := most probable word beading of [E A l +1 ,F } +1 

update counts and parameters based on b 
end 



for each sentence bead [E A l +1 ,F l +1 ] moved behind confluence point do 
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Ei If there is some evidence that 
it . . . and I will see that it 
does. 

E 2 \SCM{} Translation \ECM{} 



Fx Si on pent prouver que elle 
... je verrais a ce que elle se 
y conforme. \SCM{} 
Language = French \ECM{} 

F 2 \SCM{} Paragraph \ECM{} 



Figure 4.4: An alignment error 



end 



end 



4.4 Results 

Using this algorithm, we have aligned three large English/French corpora. We have aligned 
a corpus of 3,000,000 sentences (of both English and French) of the Canadian Hansards, 
a corpus of 1,000,000 sentences of newer Hansard proceedings, and a corpus of 2,000,000 
sentences of proceedings from the European Economic Community. In each case, we first 
bootstrapped the translation model by training on 100 previously aligned sentence pairs. 
We then trained the model further on 20,000 (unaligned) sentences of the target corpus. 

Because of the very low error rates involved, instead of direct sampling we decided to 
estimate our error on the old Hansard corpus through comparison with the alignment found 
by Brown et al. on the same corpus. We manually inspected over 500 locations where the 
two alignments differed to estimate our error rate on the alignments disagreed upon. Taking 
the error rate of the Brown alignment to be 0.6%, we estimated the overall error rate of our 
alignment to be 0.4%. 

In addition, in the Brown alignment approximately 10% of the corpus was discarded 
because of indications that it would be difficult to align. Their error rate of 0.6% holds on 
the remaining sentences. Our error rate of 0.4% holds on the entire corpus. Gale reports 
an approximate error rate of 2% on a different body of Hansard data with no discarding, 
and an error rate of 0.4% if 20% of the sentences can be discarded. 

Hence, with our algorithm we can achieve at least as high accuracy as the Brown and 
Gale algorithms without discarding any data. This is especially significant since, presum- 
ably, the sentences discarded by the Brown and Gale algorithms are those sentences most 
difficult to align. 

To give an idea of the nature of the errors our algorithm makes, we randomly sampled 
300 alignments from the newer Hansard corpus. The two errors we found are displayed in 
Figures 4.4 and 4.5. In the first error, Ex was aligned with Fi and E 2 was aligned with 



F 2 . The correct alignment maps Ei and E 2 to Fx and F 2 to nothing. In the second error, 
Ei was aligned with F\ and F 2 was aligned to nothing. The correct alignment maps Ei 
to both Fx and F 2 . Both of these errors could have been avoided with improved sentence 
boundary detection. 

The rate of alignment ranged from 2,000 to 5,000 sentences of both English and French 
per hour on an IBM RS/6000 530H workstation. Using the technique described in section 
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E x Motion No. 22 that Bill C-84 
be amended in . . . and 
substituting the following 
therefor : second anniversary 
of- 



F\ Motion No 22 que on modifie 
le projet de lot C-84 ■ ■ ■ et en 
la remplagant par ce qui suit : 
' 18. 

F2 Deux ans apres : '. 



Figure 4.5: Another alignment error 



4.3.9| , we subdivided corpora into small sections (20,000 sentences) and aligned sections in 
parallel. While it required on the order of 500 machine-hours to align the newer Hansard 
corpus, it took only 1.5 days of real time to complete the job on fifteen machines. 



4.4.1 Lexical Correspondences 

One of the by-products of alignment is the distribution pb(b) of word bead frequencies. For 
1:1 word beads b = [e, /], the probability Pb(b) can be interpreted as a measure of how 
strongly the words e and / translate to each other. Hence, Pb(b) in some sense represents a 
probabilistic word-to-word bilingual dictionary. 

For example, we can use the following measure t(e, /) as an indication of how strong 
the words e and / translate to each otherj^ 



*(e,/)=ln 



Pb([e,f}) 
Pe(e)pf{f) 



We divide Pb([e, /]) by the frequencies of the individual words to correct for the effect that 
Pb([e, /]) will tend to be higher for higher frequency words e and /. Thus, t(e, f) should not 
be skewed by the frequency of the individual words. Notice that t(e, f) is roughly equal to 
the gain in the (logarithm of the) probability of a word beading if word beads [e] and [/] 
are replaced with the bead [e, /]. Thus, t(e, f) dictates the order in which 1:1 word beads 
are applied in the search for the most probable word beading of a sentence bead described 
in Section |4.3.1| . 

In Appendix |A|, for a group of randomly sampled English words we list the French words 
that translate most strongly to them according to this measure. In general, the correspon- 
dences are fairly accurate. However, for some common prepositions the correspondences are 
rather poor; examples of this are also listed in Appendix |A|. Prepositions sometimes occur 



'This measure is closely related to the measure 



Ml F {x,y) = 



Px,y(x,y) 
px(x)p Y {y) 



referred to as mutual information by Magerman and Marcus (199C| ). This is not to be confused with the 
more common definition of mutual information: 



p(x)p(y) 
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in situations in which they have no good translation, and many prepositions have numerous 
translations.^ 

In many lists, the "French" word with the exact same spelling as the English word 
occurs near the top of the list. (A word is considered to be "French" if it occurs at any 
point in the French corpus.) This is due to the initialization described in Section 4.3.6 we 
perform for cognates. 

Notice that we are capable of acquiring strong lexical correspondences between non- 
cognates. Preliminary experiments indicate that cognate initialization does not significantly 
affect alignment accuracy. Hence, our alignment algorithm is applicable to languages with 
differing alphabets. 



4.5 Discussion 

We have described an accurate, robust, and efficient algorithm for sentence alignment. 
The algorithm can handle large deletions in text, it is language independent, and it is 
parallelizable. It requires a minimum of human intervention; for each language pair 100 
sentences need to be aligned by hand to bootstrap the translation model. Unlike previous 
algorithms, our algorithm does not require that the bilingual corpus be predivided into 
small chunks or that one can identify markers in the text that make this subdivision easier. 
Our algorithm produces a probabilistic bilingual dictionary, and it can take advantage of 
cognate correspondences, if present. 

The use of lexical information requires a great computational cost. Even with numerous 
approximations, this algorithm is tens of times slower than the length based algorithms of 
Brown et al. and Gale and Church. This is acceptable given available computing power 
and given that alignment is a one-time cost. It is unclear, though, whether more powerful 
models are worth pursuing. 

One limitation of the algorithm is that it only considers aligning a single sentence to zero, 
one, or two sentences in the other language. It may be useful to extend the set of sentence 
beads to include 2:2 or 1:3 alignments, for example. In addition, we do not consider sentence 
ordering transpositions between the two corpora. For example, the case where the first 
sentence in an English corpus translates to the second sentence in a French corpus and the 
second English sentence translates to the first French sentence cannot be handled correctly 
by our algorithm. At best, our algorithm will align one pair of sentences correctly and align 
each of the remaining two sentences to nothing. However, while extending the algorithm 
in these ways can potentially reduce the error rate by allowing the algorithm to express a 
wider range of alignments, it may actually increase error rate because the algorithm must 
consider a larger set of possible alignments and thus becomes more susceptible to random 
error. 

Thus, before adding extensions like these it may be wise to strengthen the translation 



It has been suggested that removing closed-class words before alignment may improve performance; 
these words do not provide much reliable alignment information and are often assigned spurious translations 
by our algorithm ( [Bhieber, 1996 ). 
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model, which should improve performance in general anyway. For example, one natural 
extension to the translation model is to account for word ordering. Brown et al. (1993| ) 
describe several such possible models. However, substantially greater computing power 
is required before these approaches can become practical, and there is not much room for 
further improvements in accuracy. In addition, parameter estimation becomes more difficult 
with larger models. 
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Chapter 5 

Conclusion 



In this thesis, we have presented techniques for modeling language at the word level, the 
constituent level, and the sentence level. At each level, we have developed methods that 
surpass the performance of existing methods. 

At the word level, we examined the task of smoothing n-gram language models. While 
smoothing is a fundamental technique in statistical modeling, the literature lacks any sort 
of systematic comparison of smoothing techniques for language tasks. We present an ex- 
tensive empirical comparison of the most widely-used smoothing algorithms for n-gram 
language models, the current standard in language modeling. We considered several is- 
sues not considered in previous work, such as how training data size, n-gram order, and 
parameter optimization affect performance. In addition, we introduced two novel smooth- 
ing techniques that surpass all previous techniques on trigram models and that perform 
well on bigram models. We provide some detailed analysis that helps explain the relative 
performance of different algorithms. 

At the constituent level, we investigated grammar induction for language modeling. 
While yet to achieve comparable performance, grammar-based models are a promising al- 
ternative to n-gram models as they can express both short and long-distance dependencies 
and they have the potential to be more compact than equivalent n-gram models. We in- 
troduced a probabilistic context-free grammar induction algorithm that uses the Bayesian 
framework and the minimum description length principle. By using a rich move set and 
the technique of triggering, our search algorithm is efficient and effective. We demonstrated 
that our algorithm significantly outperforms the Lari and Young induction algorithm, the 
most widely-used algorithm for probabilistic grammar induction. In addition, we were able 
to surpass the performance of n-gram models on artificially-generated data. 

At the sentence level, we examined bilingual sentence alignment. Bilingual sentence 
alignment is a necessary step in processing a bilingual corpus for use in many applica- 
tions. Previous algorithms suitable for large corpora ignore word identity information and 
just consider sentence length; we introduce an algorithm that uses lexical information effi- 
cient enough for large bilingual corpora. Furthermore, our algorithm is robust, language- 
independent, and parallelizable. We surpass all previously reported accuracy rates on the 
Hansard corpus, the most widely-used corpus in machine translation research. 
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It is interesting to note that for these three tasks, we use three very different frameworks. 
In the next section, we explain what these frameworks are and how they relate to the 
Bayesian framework. We argue that each framework used is appropriate for the associated 
problem. Finally, we show how our work on these three problems address two central issues 
in probabilistic modeling: the sparse data problem and the problem of inducing hidden 
structure. 



5.1 Bayesian Modeling 

In our work on grammar induction, we use the Bayesian framework. We attempt to find 
the grammar G with the largest probability given the training data, or observations, O. 
Applying Bayes' rule we get 

G = argmaxp(GjO) = arg max PvfJPj^l^ _ aT g ma xp(G)p(0\G) (5-1) 
G G P{0) g 

We search for the grammar G that maximizes the objective function p{G)p{0\G). 

The Bayesian framework is a very elegant and general framework. In the objective 
function, the term describing how well a grammar models the data, p(0\G), is separate from 
the term describing our a priori notion of how likely a grammar is, p(G). Furthermore, 
because we express the target grammar in a static manner instead of an algorithmic manner, 
we have a separation between the objective function and the search strategy. Thus, we can 
switch around prior distributions or search strategies without changing other parts of an 
algorithm. The Bayesian framework modularizes search problems in a general and logical 
way. 

However, notice that we could have framed both re-gram smoothing and sentence align- 
ment in the Bayesian framework as well, but we chose not to. To express smoothing in the 
Bayesian framework, we can use an analogous equation to equation (|5.1| ), e.g., 



p(M)p(0\M) 

M = argmaxp(M|0) = arg max — — = argmaxp(M)p(0|M) 

M M P[0) m 

where M denotes a smoothed n-gram model. We can design a prior p(M) over smoothed 
re-gram models and search for the model M that maximizes p(M)p(0|M).[] Instead, most 
existing smoothing algorithms as well as our novel algorithms involve a straightforward 
mapping from training data to a smoothed model.0 

In sentence alignment, from aligned data (E,F) we build a model pt(Ef;F^.) of the 
frequency with which sentences Ef and sentences Ff. occur as mutual translations in a 



1 Actually, a sl ightly different Bayesian formulation is more appropriate for smoothing, as will be men- 



tioned in Section 5.1.1. 

2 This is not quite true; Jelinek-Mercer smoothing uses the Baum- Welch algorithm to perform a maximum 
likelihood search for A values. In addition, we performed automated parameter optimization in a maximum 
likelihood manner. 
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bilingual corpus. To frame this in the Bayesian framework, we can use the equation 

Pt = &rgmaxp(p t \E, F) = argmax ^^^^ = &rg max. p(p t )p(E , F\p t ) 

Pt pt p(E, F) Pt 

We could devise a prior distribution p{pt) over possible translation models and attempt 
to find the model pt that maximizes p(pt)p(E , F\pt) . Instead, we use a variation of the 
Expectation-Maximization algorithm to perform a deterministic maximum-likelihood search 
for the model pt- 

Thus, while we could have used the Bayesian framework in each of the three problems 
we addressed, we instead used three different approaches. Below, we examine each problem 
in turn and argue why the chosen approach was appropriate for the given problem. 

5.1.1 Smoothing n-Gram Models 

First, we note that the Bayesian formulation given previously of finding the most probable 
model 

Mbest = argmaxp(M|0) 

M 

is not appropriate for the smoothing problem. The reason the most probable model Mb es t is 
significant is because we expect it to be a good model of data that will be seen in the future, 
e.g., data that is seen during the actual use of an application. In other words, we find Mb es t 
because we expect that p(0/|Mb es t) is a good model of future data Of. However, notice 
that finding Mb es t is actually superfluous; what we are really trying to find is just p(Of\0), 
a model of what future data will be like given our training data. This is a more accurate 
Bayesian formulation of the smoothing problem (and of modeling problems in general); the 
identity of -Mb es t is not important in itself. 

Using this perspective, we can explain why smoothing is generally necessary in n-gram 
modeling and other types of statistical modeling. Expressing p(Of\0) in terms of models 
M, we get 

p(O f \0) = J2p(O f ,M\0) = J2p(O f \M,0)p(M\0) = J2p(O f \M)p(M\0) 

MM M 

According to this perspective, instead of predicting future data using p(Oj|Mb es t) for just 
the most probable model Mb es t, we should actually sum p(Of\M) over all models M, weigh- 
ing each model by its probability given the training data. However, performing a sum over 
all models is generally impractical. Instead, one might consider trying to approximate this 
sum with its maximal term maxM p{0 f\M)p(M\0) . However, the identity of this term 
depends on Of, data that has not been seen yet. Thus, just using p(0/|M goo d) for some 
good model M goo d, such as the most probable model Mb es t) m &y be the best we can do 
in practice. Smoothing can be interpreted as a method for correcting this gap between 
theory and reality, between p(Of\0) and p{Of\M goo <\). Viewed from this perspective, most 
smoothing algorithms for n-gram models do not even use an intermediate model M g00( j , but 
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just estimate the distribution p(Of\0) directly from counts in the training data.^ 

We argue that the Bayesian approach is not attractive for the smoothing problem from a 
methodological perspective. In the Bayesian framework, the nature of the prior probability 
p(M) is the largest factor in determining performance. The distribution p(M) should reflect 
how frequently smoothed ra-gram models M occur in the real world. However, it is unclear 
what a priori information we have pertaining to the frequency of different smoothed n-gram 
models; we have little or no intuition on this topic. In addition, adjusting p{M) to try to 
yield an accurate distribution p(Of\0) is a rather indirect process. 

For smoothing, it is much easier to estimate p(Of\0) directly. We have insight into 
what a smoothed distribution p(OAO) should look like given the counts in the training 
data. For example, we know that an n-gram with zero counts should be given some small 
nonzero corrected count, and that an n-gram with r > counts should be given a corrected 
count slightly less than r, so that there will be counts available for zero-count n-grams. 
These corrected counts lead directly to a model p(Of\0). In addition, detailed performance 
analyses such as the analysis described in Section |2.5| lend themselves well to improving 
algorithms that estimate p(Of\0) directly. 



There are several existing Bayesian smoothing methods ( Nadas, 1984 ; MacKay and 



Peto, 1995| ; |Ristad, 1995| ), but none perform particularly well or are in wide use. 



5.1.2 Bayesian Grammar Induction 

In grammar induction, we have a very different situation from that found in smoothing. In 
smoothing, we have intuitions on how to estimate p(Of\0), but little intuition of how to 
estimate p{M). In grammar induction, we have the opposite situation. In this case, the 



distribution p(Of\0) has a very complex nature. In Section 3.2.3, we give examples that 
hint at the complexity of this distribution. For example, if we see a sequence of numbers 
in some text that happen to be consecutive prime numbers, people know how to predict 



future numbers in the sequence with high probability. In Section |3.2.3j , we explain how 
complex behaviors such as this can be modeled by using the Bayesian abstraction. We take 
the probability p(0) of some data O to be 

p(O) = Y<P(0,G p ) = Y,P{G P )p{0\G P ) = p(°p) 

G p G p output(Gp) = O 

where G p represents a program; that is, we view data as being the output of some pro- 
gram. By assigning higher probabilities to shorter programs, we get the desirable behavior 
that complex patterns in data can be modeled. In the grammar induction task, we just 
restrict programs G p to those that correspond to grammars G. In this domain, we have 
insight into the nature of the prior probability p(G), which takes a fairly simple form, while 
p(Of\0) takes a very complicated form. Thus, unlike smoothing, we find that the Bayesian 
perspective is appropriate for grammar induction. 



3 Alternatively, we can just view M ROO( ^ as being the maximum likelihood n-gram model. 
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5.1.3 Bilingual Sentence Alignment 

In sentence alignment, the situation is more similar to grammar induction than smoothing. 
We have a complex distribution p(Of\0), or using our alignment notation, p(Ef, Ff\E, F). 
Instead of modeling p(Ef, Ff\E, F) directly, it is more reasonable to use the Bayesian frame- 
work and to design a prior on translation models using the minimum description length 
principle. This would yield some desirable behaviors; for example, in the description of 
the full translation model we would include the description of the word-to-word translation 
model Pb(b), which we can describe by listing all word beads b with nonzero probability. 
A minimum description length objective function would favor models with fewer nonzero- 
probability word beads, thus discouraging words from having superfluous translations in 
the model. 

In actuality, we decided to use a maximum likelihood approach, which is equivalent to 
using the Bayesian approach with a uniform prior over translation models. Specifically, we 
used a variation of the Expectation-Maximization (EM) algorithm, which is a hill-climbing 
search on the likelihood of the training data. However, we used an incremental variation 
of the algorithm. Typically, the EM algorithm is an iterative algorithm, where in each 
iteration the entire training data is processed. At the end of each iteration, parameters are 
re-estimated so that the likelihood of the data increases (or does not decrease, at least). In 
our incremental version, we make a single pass through the training data, and we re-estimate 
parameters after each sentence bead. Thus, we do not perform a true maximum likelihood 
search; we just do something roughly equivalent to a single iteration in a conventional EM 
search. 

While we make this choice partially because a full EM search is too expensive computa- 
tionally, we also believe that a full EM search would yield poorer results. Notice that with 
each iteration of EM performed, the current hypothesis model fits closer to the training 
data. Recall that maximum likelihood models tend to overfit training data, as the uniform 
prior assigns too much probability to large models. Thus, additional EM iterations will 
eventually lead to more and more overfitting. By only doing something akin to a single 
iteration of EM, we avoid the overfitting problem.^ 

One way of viewing this is that we express a prior distribution over models procedurally. 
Instead of having an explicit prior that prefers smaller models, we avoid overfitting through 
heuristics in our search strategy. This violates the separation between the objective function 
and the search strategy found in the Bayesian framework. On the other hand, it significantly 
decreases the complexity of the implementation. In the Bayesian framework, one needs to 
design an explicit prior over models and to perform a search for the most probable model; 
these tasks are expensive both from a design perspective and computationally. Further- 
more, in sentence alignment it is not necessary to have a tremendously accurate translation 
model, as lexical information usually provides a great deal of distinction between correct 
and incorrect alignments. The use of ad hoc methods is not likely to decrease performance 

4 Better performance may be achieved by using held-out data to determine when overfitting begins, and 
stopping the EM search at that point. However, additional EM passes are expensive computationally and 
our single-pass approach performs well. 
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significantly. Thus, we argue that the use of ad hoc methods such as procedural priors is 
justified for sentence alignment as it saves a great deal of effort and computation without 
loss in performance. 

In summary, while the Bayesian framework provides a principled and effective approach 
to many tasks, it is best to adjust the framework to the exigencies of the particular problem. 
In grammar induction, an explicit Bayesian implementation proved worthwhile, while in 
sentence alignment less rigorous methods performed well. Finally, for smoothing we argue 
that non-Bayesian methods can be more effective. 



5.2 Sparse Data and Inducing Hidden Structure 

As mentioned in Chapter [l], perhaps the two most important issues in probabilistic modeling 
are the sparse data problem and the problem of inducing hidden structure. The sparse data 
problem describes the situation of having insufficient data to train one's model accurately. 
The problem of inducing hidden structure refers to the task of building models that capture 
structure not explicitly present in the training data, e.g., the grammatical structure that we 
capture in our work in grammar induction. It is widely thought that models that capture 
hidden structure can ultimately outperform the shallow models that currently offer the best 
performance. In this thesis, we present several techniques that address these two central 
issues in probabilistic modeling. 



5.2.1 Sparse Data 

Approaches to the sparse data problem can be grouped into two general categories. The 
first type of approach is smoothing, where one takes existing models and investigates tech- 
niques for more accurately assigning probabilities in the presence of sparse data. Our work 
on smoothing n-gram models greatly forwards the literature on smoothing for the most 
frequently used probabilistic models for language. We clarify the relative performance of 
different smoothing algorithms on a variety of data sets, facilitating the selection of smooth- 
ing algorithms for different applications; no thorough empirical study existed before. In 
addition, we provide two new smoothing techniques that outperform existing techniques. 

The second general approach to the sparse data problem is the use of compact models. 
Compact models contain fewer probabilities that need to be estimated and hence require less 
data to train. While n-gram models are yet to be outperformed by more compact models, 
this approach to the sparse data problem seems to be the most promising as smoothing 
most likely will yield limited gains|] As mentioned in Section 3.1.2, probabilistic grammars 



offer the potential for achieving performance equivalent to that of n-gram models with much 
smaller models. In our work on Bayesian grammar induction, we introduce a novel algorithm 



5 There has been some success in combining the two approaches. For example, Brown et al. (1992b) show 
how class-based n-gram models can achieve performance near to that of n-gram models using much fewer 
parameters. They then show that by linearly interpolating or smoothing conventional n-gram models with 
class-based n-gram models, it is possible to achieve performance slightly superior to that of conventional 
n-gram models alone. 
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for grammar induction that employs the minimum description length principle, under which 
the objective function used in the grammar search process explicitly favors compact models. 
In experiments on artificially-generated data, we can achieve performance equivalent to that 
of ra-gram models using a probabilistic grammar that is many times smaller, as shown in 
Section [0| While unable to outperform ra-gram models on naturally-occurring data, this 
work represents very real progress in constructing compact models. 

5.2.2 Inducing Hidden Structure 

Models that capture the hidden structure underlying language have the potential to outper- 
form shallow models such as ra-gram models. Not only does our work in grammar induction 
forward research in building compact models as described above, it also demonstrates tech- 
niques for inducing hidden structure. (Clearly, the problems of sparse data and inducing 
hidden structure are not completely orthogonal, as taking advantage of hidden structure 
can lead to more compact models.) In Section [O], we show how on artificially-generated 
text our grammar induction algorithm is able to capture much of the structure present in 
the grammar used to generate the text, demonstrating that our algorithm can effectively 
extract hidden structure from data. 

Our work in bilingual sentence alignment also addresses the problem of inducing hid- 
den structure. Given a raw bilingual corpus, sentence alignment involves recovering the 
hidden mapping between the two texts that specifies the sentence(s) in one language that 
translate to each sentence in the other language. To this end, our alignment algorithm also 
calculates a rough mapping between individual words in sentences in the two languages. 
Unlike in grammar induction, the model used to induce this hidden structure is a fairly 
shallow model; the model is just used to annotate data with the extracted structural infor- 
mation. This annotated data can then be used to train structured models, as in work by 
Brown et al. (1990|) . 

In this work on hidden structure, we place an emphasis on the use of efficient algorithms. 
A major issue in inducing hidden structure is constraining the search process. Because the 
structure is hidden, it is difficult to select which structures to consider creating, and as a 
result many algorithms dealing with hidden structure induction are inefficient because they 
do not adequately constrain the search space. Our algorithms for grammar induction and 
bilingual sentence alignment are both near-linear; both are far more efficient than all other 
algorithms (involving hidden structure induction) that offer comparable performance. 

We achieve this efficiency through data-driven heuristics that constrain the set of hy- 
potheses considered in the search process and through heuristics that allow hypotheses to 
be evaluated very quickly. In grammar induction, we introduce the concept of triggers, 
or particular patterns in the data that indicate that the creation of certain rules may be 
favorable. Triggers reduce the number of grammars considered to a manageable amount. In 
addition, to evaluate the objective function efficiently, we use sensible heuristics to estimate 
the most probable parse of the training data given the current grammar and to estimate 
the optimal values of rule probabilities. 

In sentence alignment, we use thresholding to reduce the computation of the dynamic 
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programming lattice from quadratic to linear in data size. To enable efficient evaluation 
of hypothesis alignments, we use heuristics to constrain which word beads have nonzero 
probability. As mentioned in Section 4.3,5| , limiting the number of such word beads greatly 
simplifies the search for the most probable beading of a sentence bead. We believe that 
data-driven heuristics such as the ones that we have employed are crucial for making hidden 
structure induction efficient enough for large data sets. 

In conclusion, this thesis represents a very significant step forward towards addressing 
two central issues in probabilistic modeling: the sparse data problem and the problem of in- 
ducing hidden structure. We introduce novel techniques for smoothing and for constructing 
compact models, as well as novel and efficient techniques for inducing hidden structure. 
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Appendix A 

Sample of Lexical Correspondences 
Acquired During Bilingual 
Sentence Alignment 



In this section, we list randomly-sampled lexical correspondences acquired during the align- 
ment of 20,000 sentences pairs. We only list words occurring at least ten times. The numbers 
adjacent to the English words are the number of times those English words occurred in the 
corpus. The number next to each French word / is the value 



t(e,/) = ln 



Pb([e,f}) 
Pe(e)p f (f) 



for the English word e above; this is a measure of how strongly the English word and French 
word translate to each other, 
quality (27) 

qualite 
qualitatives 
eaux 
form (70) 

forme 

trouveraient 
sorte 

obligations 
aj outer 
sous 



une 
avons 



11.69 
11.46 
9.52 

10.11 
8.72 
7.18 
6.69 
6.49 
5.03 
4.96 
3.97 



keeps (16) 

engagee 
continue 
que 

houses (20) 

maisons 
chambres 
habitations 
maison 
domiciliaire 
logements 
parlementaires 
acheter 



9.18 
8.61 
4.66 

11.47 
10.77 
9.41 
9.34 
9.17 
9.14 
8.01 
7.69 
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throughout (33) 
travers 
agriculteurs 
long 
toute 
dans 

organismes 
ou 

durant 

moyens 

tout 
enterprises (21) 

entreprises 

poursuite 

faciliter 

importance 

meme 

une 
delivery (17) 

livraison 

livraisons 

modifiees 

cependant 

avant 

steadily (14) 

const amment 

cesse 
especially (25) 

surtout 

particulierement 

specialement 

particulier 

notamment 

precisement 

assurer 

qui 



minimum (27) 

9.51 minimum 12.44 

8.58 minimal 12.43 

8.56 minimale 11.80 
7.87 minimaux 11.42 
7.34 minimums 11.19 
7.28 minimales 11.03 
6.83 etudiees 8.95 
6.55 moins 6.61 
5.98 jusque 5.87 
5.53 avec 4.87 

appear (35) 

10.21 comparaitre 9.66 

8.92 semblent 9.36 

8.87 semble 8.92 

6.78 voulons 7.36 

4.94 frais 6.73 

4.48 -t- 5.26 
stocks (17) 

11.75 stocks 11.75 

10.57 reserves 9.54 

9.57 bancs 9.51 
7.72 valeurs 8.75 
7.28 actions 8.19 
6.37 exercer 7.78 

combined (13) 

8.21 joints 9.91 

8.18 deux 7.67 
floor (30) 

11.20 plancher 11.61 

10.99 parole 9.42 

10.18 locataires 8.66 

9.55 chambre 7.45 

9.43 losing (19) 

7.65 perdons 10.54 

6.17 perd 9.92 

4.74 perdre 9.43 
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new (134) 



new 


11 1 Q 


nouveciux 


111^ 

11.10 


nouveues 


i n 7/1 

1U. / 4 


nouvelle 


10.70 


nouveau 


10.54 


nouvel 


10.30 


democrate 


9.59 


neuves 


9.46 


democrates 


9.17 


neuf 


8.64 



A.l Poor Correspondences 

In this section, we list some selected English 
dences are not overly appropriate, 
the (19379) 



the 


7.42 


la 


6.56 


a 


5.65 


au 


5.23 


encourage 


5.17 


cette 


4.96 


prescrit 


4.93 


profitable 


4.92 


parenchymes 


4.90 


tenu 


4.86 



on (1577) 



commenter 


6.96 


au 


6.84 


devrai 


6.70 


visites 


6.63 


affreusement 


6.43 


souleve 


6.15 


attaquant 


5.98 


vinicole 


5.97 


victime 


5.87 


ensuite 


5.84 



manner (43) 



f 51 r*f~»Ti 
lazuli 


8 59 


in 51 mprp 


8 97 


foi l rli pr 

IUUL11C1 


7 06 


quoi 


O.U i 


avaient 


5.77 


m. 


5.76 


avec 


5.75 


destinees 


5.70 


-t- 


5.43 


aussi 


4.99 



3 for which the acquired lexical correspon- 
at (2292) 



at 


7.66 


heures 


6.89 


entreposes 


6.60 


heure 


6.39 


conformite 


5.98 


rapatrie 


5.95 


immobilises 


5.93 


lors 


5.89 


voyant 


5.89 


respectifs 


5.88 



of (23032) 



of 


7.91 


rappel 


6.63 


fermeture 


5.56 


historiques 


5.51 


des 


5.12 


demanderais 


4.87 


ordre 


4.77 


entendu 


4.77 


presider 


4.64 


regne 


4.63 
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